hanaml.Kmeans {hana.ml.r}R Documentation

Kmeans

Description

hanaml.Kmeans is a R wrapper for PAL K-means and accelerated K-Means algorithm.

Usage

hanaml.Kmeans(conn.context,
              data = NULL,
              key = NULL,
              features = NULL,
              n.clusters = NULL,
              n.clusters.min = NULL,
              n.clusters.max = NULL,
              init = NULL,
              max.iter = NULL,
              tol = NULL,
              thread.ratio = NULL,
              distance.level = NULL,
              minkowski.power = NULL,
              category.weights = NULL,
              normalization = NULL,
              categorical.variable = NULL,
              memory.mode = NULL,
              accelerated = FALSE)

Arguments

conn.context

ConnectionContext
The connection to the SAP HANA system.

data

DataFrame
DataFrame containing the data.

key

character
Name of the ID column.

features

character, optional
Name of the features columns.

n.clusters

integer, optional
Number of groups.
No default value.

n.clusters.min

integer, optional
Lower boundary of the clustering range.
No default value.

n.clusters.max

integer, optional
Upper boundary of the clustering range.
No default value.

init

character, optional
Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacements.

  • 'no_replace': Random without replacements.

  • 'patent':Patent of selecting the init center (US 6,882,998 B1).


Defaults to 'patent'.

max.iter

integer, optional
Maximum number of iterations.
Defaults to 100.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Defaults to 0.

distance.level

character, optional
Specifies how to compute the distance between the item and the cluster center. Valid options are 'manhattan', 'euclidean', 'minkowski', 'chebyshev', and 'cosine' (valid only when accelerated is FALSE).
Defaults to 'euclidean'.

minkowski.power

double, optional
When Minkowski distance is used, this parameter controls the value of power.
Only valid when distance.level is 3.
Defaults to 3.0.

category.weights

double, optional
Represents the weight of category attributes.
Defaults to 0.707.

normalization

character, optional
Normalization type:

  • 'no': no normalization.

  • l1.norm: Yes, for each point X (x1,x2,...,xn), the normalized value will be X'(x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.

  • min.max: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical.variable

character or list of characters, optional
Column names in the data table to use as category variable.
No default value.

tol

double, optional
Convergence threshold for exiting iterations. Only valid when accelerated is FALSE.
Defaults to 1.0e-6.

memory.mode

character, optional
Indicates the memory mode the algorithm uses.

  • 'auto': Chosen by the algorithm.

  • 'optimized-speed': Priorities speed.

  • 'optimized-space': Priorities saving memory.

Only valid when accelerated is TRUE. Defaults to 'auto'.

accelerated

logical, optional
Indicates whether or not to accelerate the calculation process.
Defaults to FALSE.

Format

R6Class object.

Value

See Also

predict.Kmeans

Examples

## Not run: 
Input DataFrame data for training:
    ID V000  V001  V002
 1   0  0.5    A  0.5
 2   1  1.5    A  0.5
 3   2  1.5    A  1.5
 4   3  0.5    A  1.5
 5   4  1.1    B  1.2
 6   5  0.5    B 15.5
 7   6  1.5    B 15.5
 8   7  1.5    B 16.5
 9   8  0.5    B 16.5
 10  9  1.2    C 16.1
 11 10 15.5    C 15.5
 12 11 16.5    C 15.5
 13 12 16.5    C 16.5
 14 13 15.5    C 16.5
 15 14 15.6    D 16.2
 16 15 15.5    D  0.5
 17 16 16.5    D  0.5
 18 17 16.5    D  1.5
 19 18 15.5    D  1.5
 20 19 15.7    A  1.6

 Model traning and a "Kmeans" object km is returned:
> km <- hanaml.Kmeans(conn.context = conn,
                     data = data,
                     features = NULL,
                     n.clusters = 4,
                     init = "first_k",
                     max.iter = 100,
                     tol = 1.0E-6,
                     thread.ratio = 0.2,
                     distance.level = "euclidean",
                     category.weights = 0.5)
Expected output:
> km$labels$Collect()
     ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
1      0           0  0.891088           0.944370
2      1           0  0.863917           0.942478
3      2           0  0.806252           0.946288
4      3           0  0.835684           0.944942
5      4           0  0.744571           0.950234
6      5           3  0.891088           0.940733

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]