K-Medoids — hanaml.KMedoid • hana.ml.r

hanaml.KMedoid is a R wrapper for SAP HANA PAL KMedoids algorithm.

hanaml.KMedoid(
  data,
  key,
  features = NULL,
  n.clusters,
  init = NULL,
  max.iter = NULL,
  tol = NULL,
  thread.ratio = NULL,
  distance.level = NULL,
  minkowski.power = NULL,
  category.weights = NULL,
  normalization = NULL,
  categorical.variable = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
key	`character` Name of the ID column.
features	`character or list of characters, optional` Names of features columns. If is not provided, it defaults to all non-key columns of `data`.
n.clusters	`integer` Number of groups.
init	`character, optional` Controls how the initial centers are selected: `"first.k"`: First k observations. `"replace"`: Random with replacements. `"no.replace"`: Random without replacements. `"patent"`:Patent of selecting the init center (US 6,882,998 B1). Defaults to "patent".
max.iter	`integer, optional` Maximum number of iterations. Defaults to 100.
tol	`double, optional` Threshold for exiting the iteration. Defaults to 1e-6.
thread.ratio	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored. Defaults to 0.
distance.level	`character, optional` Specifies how to compute the distance between the item and the cluster center. `"manhattan"` `"euclidean"` `"minkowski"` `"chebyshev"` `"cosine"` Defaults to "euclidean".
minkowski.power	`double, optional` When Minkowski distance is used, this parameter controls the value of power. Only valid when `distance.level` is 'minkowski'. Defaults to 3.0.
category.weights	`double, optional` Represents the weight of category attributes. Defaults to 0.707.
normalization	`character, optional` Specifies the normalization type: `'no'`: no normalization. `'l1.norm'`: For each point X = (x₁,x₂,...,x_n), the normalized value will be X' = (x₁/S,x₂/S,...,x_n/S), where S = \|x₁\|+\|x₂\|+...\|x_n\|. `min.max`: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min). Defaults to "no".
categorical.variable	`character or list/vector of characters, optional` Indicates features should be treated as categorical variable. The default behavior is dependent on what input is given: "VARCHAR" and "NVARCHAR": categorical "INTEGER" and "DOUBLE": continuous. VALID only for variables of "INTEGER" type, omitted otherwise. No default value.

Value

A "KMedoid" object with the following attributes:

labels : DataFrame
Label assigned to each sample.
cluster.centers : DataFrame
Coordinates of cluster centers.

Details

The K-Medoids clustering algorithm partitions n observations into K clusters according to their nearest cluster center. It uses medoids to calculate cluster centers. The K-Medoids algorithm is more robust in regards to noise and outliers.

Examples

Input DataFrame data:

> data$Collect()
     ID  V000 V001  V002
1    0   0.5    A   0.5
2    1   1.5    A   0.5
3    2   1.5    A   1.5
4    3   0.5    A   1.5
5    4   1.1    B   1.2
......
16  15  15.5    D   0.5
17  16  16.5    D   0.5
18  17  16.5    D   1.5
19  18  15.5    D   1.5
20  19  15.7    A   1.6

Call the function:

> kmed <- hanaml.KMedoid(data = data,
                         key = "ID",
                         n.clusters = 4,
                         init = "first_k",
                         max.iter = 100,
                         tol = 1.0E-6,
                         thread.ratio = 0.3,
                         distance.level = "Euclidean",
                         category.weights = 0.5)

Output:

> kmed$cluster.centers$Collect()
   CLUSTER_ID  V000 V001  V002
1           0   1.5    A   1.5
2           1  15.5    D   1.5
3           2  15.5    C  16.5
4           3   1.5    B  16.5

>dkmed$labels$Collect()
   ID CLUSTER_ID  DISTANCE
1   0          0 1.4142136
2   1          0 1.0000000
3   2          0 0.0000000
4   3          0 1.0000000
......
17 16          1 1.4142136
18 17          1 1.0000000
19 18          1 0.0000000
20 19          1 0.9307136