R: Kmeans

hanaml.Kmeans {hana.ml.r}

R Documentation

Kmeans

Description

hanaml.Kmeans is a R wrapper for PAL K-means and accelerated K-Means algorithm.

Usage

hanaml.Kmeans(conn.context,
              data = NULL,
              key = NULL,
              features = NULL,
              n.clusters = NULL,
              n.clusters.min = NULL,
              n.clusters.max = NULL,
              init = NULL,
              max.iter = NULL,
              tol = NULL,
              thread.ratio = NULL,
              distance.level = NULL,
              minkowski.power = NULL,
              category.weights = NULL,
              normalization = NULL,
              categorical.variable = NULL,
              memory.mode = NULL,
              accelerated = FALSE)

Arguments

`conn.context`	`ConnectionContext` The connection to the SAP HANA system.
`data`	`DataFrame` DataFrame containing the data.
`key`	`character` Name of the ID column.
`features`	`character, optional` Name of the features columns.
`n.clusters`	`integer, optional` Number of groups. No default value.
`n.clusters.min`	`integer, optional` Lower boundary of the clustering range. No default value.
`n.clusters.max`	`integer, optional` Upper boundary of the clustering range. No default value.
`init`	`character, optional` Controls how the initial centers are selected: `'first_k'`: First k observations. `'replace'`: Random with replacements. `'no_replace'`: Random without replacements. `'patent'`:Patent of selecting the init center (US 6,882,998 B1). Defaults to 'patent'.
`max.iter`	`integer, optional` Maximum number of iterations. Defaults to 100.
`thread.ratio`	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads. Defaults to 0.
`distance.level`	`character, optional` Specifies how to compute the distance between the item and the cluster center. Valid options are 'manhattan', 'euclidean', 'minkowski', 'chebyshev', and 'cosine' (valid only when accelerated is FALSE). Defaults to 'euclidean'.
`minkowski.power`	`double, optional` When Minkowski distance is used, this parameter controls the value of power. Only valid when `distance.level` is 3. Defaults to 3.0.
`category.weights`	`double, optional` Represents the weight of category attributes. Defaults to 0.707.
`normalization`	`character, optional` Normalization type: `'no'`: no normalization. `l1.norm`: Yes, for each point X (x1,x2,...,xn), the normalized value will be X'(x1/S,x2/S,...,xn/S), where S = \|x1\|+\|x2\|+...\|xn\|. `min.max`: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min). Defaults to 'no'.
`categorical.variable`	`character or list of characters, optional` Column names in the data table to use as category variable. No default value.
`tol`	`double, optional` Convergence threshold for exiting iterations. Only valid when accelerated is FALSE. Defaults to 1.0e-6.
`memory.mode`	`character, optional` Indicates the memory mode the algorithm uses. `'auto'`: Chosen by the algorithm. `'optimized-speed'`: Priorities speed. `'optimized-space'`: Priorities saving memory. Only valid when accelerated is TRUE. Defaults to 'auto'.
`accelerated`	`logical, optional` Indicates whether or not to accelerate the calculation process. Defaults to FALSE.

Format

R6Class object.

Value

labels : DataFrame
Label assigned to each sample.
cluster.centers : DataFrame
Coordinates of cluster centers.
model : DataFrame
Model content.
statistics : DataFrame
Statistic value.

Examples

## Not run: 
Input DataFrame data for training:
    ID V000  V001  V002
 1   0  0.5    A  0.5
 2   1  1.5    A  0.5
 3   2  1.5    A  1.5
 4   3  0.5    A  1.5
 5   4  1.1    B  1.2
 6   5  0.5    B 15.5
 7   6  1.5    B 15.5
 8   7  1.5    B 16.5
 9   8  0.5    B 16.5
 10  9  1.2    C 16.1
 11 10 15.5    C 15.5
 12 11 16.5    C 15.5
 13 12 16.5    C 16.5
 14 13 15.5    C 16.5
 15 14 15.6    D 16.2
 16 15 15.5    D  0.5
 17 16 16.5    D  0.5
 18 17 16.5    D  1.5
 19 18 15.5    D  1.5
 20 19 15.7    A  1.6

 Model traning and a "Kmeans" object km is returned:
> km <- hanaml.Kmeans(conn.context = conn,
                     data = data,
                     features = NULL,
                     n.clusters = 4,
                     init = "first_k",
                     max.iter = 100,
                     tol = 1.0E-6,
                     thread.ratio = 0.2,
                     distance.level = "euclidean",
                     category.weights = 0.5)
Expected output:
> km$labels$Collect()
     ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
1      0           0  0.891088           0.944370
2      1           0  0.863917           0.942478
3      2           0  0.806252           0.946288
4      3           0  0.835684           0.944942
5      4           0  0.744571           0.950234
6      5           3  0.891088           0.940733

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]

Kmeans

Description

Usage

Arguments

Format

Value

See Also

Examples