K-Medians — hanaml.KMedian • hana.ml.r

hanaml.KMedian is a R wrapper for SAP HANA PAL KMedian algorithm.

hanaml.KMedian(
  data,
  key,
  features = NULL,
  n.clusters,
  init = NULL,
  max.iter = NULL,
  tol = NULL,
  thread.ratio = NULL,
  distance.level = NULL,
  minkowski.power = NULL,
  category.weights = NULL,
  normalization = NULL,
  categorical.variable = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

features

character or list of characters, optional
Names of features columns.
If is not provided, it defaults to all non-key columns of data.

n.clusters

integer
Number of groups.

init

character, optional
Controls how the initial centers are selected:

"first.k": First k observations.
"replace": Random with replacements.
"no.replace": Random without replacements.
"patent":Patent of selecting the init center (US 6,882,998 B1).

Defaults to "patent".

max.iter

integer, optional
Maximum number of iterations.
Defaults to 100.

tol

double, optional
Convergence threshold for exiting iterations. Defaults to 1e-6.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

distance.level

character, optional
Specifies how to compute the distance between the item and the cluster center.

"manhattan"
"euclidean"
"minkowski"
"chebyshev"
"cosine"

Defaults to "euclidean".

minkowski.power

double, optional
When Minkowski distance is used, this parameter controls the value of power.
Only valid when distance.level is 'minkowski'.
Defaults to 3.0.

category.weights

double, optional
Represents the weight of category attributes.
Defaults to 0.707.

normalization

character, optional
Specifies the normalization type:

'no': no normalization.
'l1.norm': For each point X = (x₁,x₂,...,x_n), the normalized value will be X' = (x₁/S,x₂/S,...,x_n/S), where S = |x₁|+|x₂|+...|x_n|.
min.max: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to "no".

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

Value

A "KMedian" object with the following attributes:

labels : DataFrame
Label assigned to each sample.
cluster.centers : DataFrame
Coordinates of cluster centers.

Details

The K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medians of each feature to calculate cluster centers.

Examples

Input DataFrame data:


> data$Collect()
    ID  V000 V001  V002
1    0   0.5    A   0.5
2    1   1.5    A   0.5
3    2   1.5    A   1.5
4    3   0.5    A   1.5
5    4   1.1    B   1.2
......
17  16  16.5    D   0.5
18  17  16.5    D   1.5
19  18  15.5    D   1.5
20  19  15.7    A   1.6

Call the function:


> kmedian <- hanaml.KMedian(data = data,
                            key = "ID",
                            n.clusters = 4,
                            init = "first_k",
                            max.iter = 100,
                            tol = 1.0E-6,
                            thread.ratio = 0.3,
                            distance.level = "euclidean",
                            category.weights = 0.5)

Output:


> kmedian$cluster.centers$Collect()
    CLUSTER_ID  V000 V001  V002
1           0   1.1    A   1.2
2           1  15.7    D   1.5
3           2  15.6    C  16.2
4           3   1.2    B  16.1

> kmedian$labels$Collect()
    ID CLUSTER_ID  DISTANCE
1   0          0 0.9219544
2   1          0 0.8062258
3   2          0 0.5000000
4   3          0 0.6708204
......
17 16          1 1.2806248
18 17          1 0.8000000
19 18          1 0.2000000
20 19          1 0.8071068