K-Medians — hanaml.KMedian • hana.ml.r

hanaml.KMedian is a R wrapper for SAP HANA PAL KMedian algorithm.

hanaml.KMedian(
  data,
  key,
  features = NULL,
  n.clusters,
  init = NULL,
  max.iter = NULL,
  tol = NULL,
  thread.ratio = NULL,
  distance.level = NULL,
  minkowski.power = NULL,
  category.weights = NULL,
  normalization = NULL,
  categorical.variable = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
key	`character` Name of the ID column.
features	`character or list of characters, optional` Names of features columns. If is not provided, it defaults to all non-key columns of `data`.
n.clusters	`integer` Number of groups.
init	`character, optional` Controls how the initial centers are selected: `"first.k"`: First k observations. `"replace"`: Random with replacements. `"no.replace"`: Random without replacements. `"patent"`:Patent of selecting the init center (US 6,882,998 B1). Defaults to "patent".
max.iter	`integer, optional` Maximum number of iterations. Defaults to 100.
tol	`double, optional` Convergence threshold for exiting iterations. Defaults to 1e-6.
thread.ratio	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored. Defaults to 0.
distance.level	`character, optional` Specifies how to compute the distance between the item and the cluster center. `"manhattan"` `"euclidean"` `"minkowski"` `"chebyshev"` `"cosine"` Defaults to "euclidean".
minkowski.power	`double, optional` When Minkowski distance is used, this parameter controls the value of power. Only valid when distance.level is 'minkowski'. Defaults to 3.0.
category.weights	`double, optional` Represents the weight of category attributes. Defaults to 0.707.
normalization	`character, optional` Specifies the normalization type: `'no'`: no normalization. `'l1.norm'`: For each point X = (x₁,x₂,...,x_n), the normalized value will be X' = (x₁/S,x₂/S,...,x_n/S), where S = \|x₁\|+\|x₂\|+...\|x_n\|. `min.max`: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min). Defaults to "no".
categorical.variable	`character or list/vector of characters, optional` Indicates features should be treated as categorical variable. The default behavior is dependent on what input is given: "VARCHAR" and "NVARCHAR": categorical "INTEGER" and "DOUBLE": continuous. VALID only for variables of "INTEGER" type, omitted otherwise. No default value.

Value

A "KMedian" object with the following attributes:

labels : DataFrame
Label assigned to each sample.
cluster.centers : DataFrame
Coordinates of cluster centers.

Details

The K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medians of each feature to calculate cluster centers.

Examples

Input DataFrame data:

> data$Collect()
    ID  V000 V001  V002
1    0   0.5    A   0.5
2    1   1.5    A   0.5
3    2   1.5    A   1.5
4    3   0.5    A   1.5
5    4   1.1    B   1.2
......
17  16  16.5    D   0.5
18  17  16.5    D   1.5
19  18  15.5    D   1.5
20  19  15.7    A   1.6

Call the function:

> kmedian <- hanaml.KMedian(data = data,
                            key = "ID",
                            n.clusters = 4,
                            init = "first_k",
                            max.iter = 100,
                            tol = 1.0E-6,
                            thread.ratio = 0.3,
                            distance.level = "euclidean",
                            category.weights = 0.5)

Output:

> kmedian$cluster.centers$Collect()
    CLUSTER_ID  V000 V001  V002
1           0   1.1    A   1.2
2           1  15.7    D   1.5
3           2  15.6    C  16.2
4           3   1.2    B  16.1

> kmedian$labels$Collect()
    ID CLUSTER_ID  DISTANCE
1   0          0 0.9219544
2   1          0 0.8062258
3   2          0 0.5000000
4   3          0 0.6708204
......
17 16          1 1.2806248
18 17          1 0.8000000
19 18          1 0.2000000
20 19          1 0.8071068