hanaml.KMeans is a R wrapper for SAP HANA PAL K-means and accelerated K-Means algorithm.

hanaml.KMeans(
  data = NULL,
  key = NULL,
  features = NULL,
  n.clusters = NULL,
  n.clusters.min = NULL,
  n.clusters.max = NULL,
  init = NULL,
  max.iter = NULL,
  tol = NULL,
  thread.ratio = NULL,
  distance.level = NULL,
  minkowski.power = NULL,
  category.weights = NULL,
  normalization = NULL,
  categorical.variable = NULL,
  memory.mode = NULL,
  accelerated = FALSE,
  use.fast.library = NULL,
  use.float = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

features

character or list of characters, optional
Names of features columns.
If is not provided, it defaults to all non-key columns of data.

n.clusters

integer, optional
Number of groups.
Note: If this parameter is not specified, you must specify the range of k using n.clusters.min and n.clusters.max. Then the algorithm will iterate through the range and return the k with the highest slight silhouette.
No default value.

n.clusters.min

integer, optional
Lower boundary of the clustering range.
Note: You must specify either an exact value or a range for k. If both are specified, the exact value will be used.
No default value.

n.clusters.max

integer, optional
Upper boundary of the clustering range.
Note: You must specify either an exact value or a range for k. If both are specified, the exact value will be used.
No default value.

init

character, optional
Controls how the initial centers are selected:

  • "first.k": First k observations.

  • "replace": Random with replacements.

  • "no.replace": Random without replacements.

  • "patent":Patent of selecting the init center (US 6,882,998 B1).


Defaults to "patent".

max.iter

integer, optional
Maximum number of iterations.
Defaults to 100.

tol

double, optional
Convergence threshold for exiting iterations. Only valid when accelerated is FALSE.
Defaults to 1.0e-6.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

distance.level

character, optional
Specifies how to compute the distance between the item and the cluster center.

  • "manhattan"

  • "euclidean"

  • "minkowski"

  • "chebyshev"

  • "cosine" valid only when accelerated is FALSE.

Defaults to "euclidean".

minkowski.power

double, optional
When Minkowski distance is used, this parameter controls the value of power.
Only valid when distance.level is 'minkowski'.
Defaults to 3.0.

category.weights

double, optional
Represents the weight of category attributes.
Defaults to 0.707.

normalization

character, optional
Specifies the normalization type:

  • 'no': no normalization.

  • 'l1.norm': For each point X = (x1,x2,...,xn), the normalized value will be X' = (x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.

  • min.max: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to "no".

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

  • "VARCHAR" and "NVARCHAR": categorical

  • "INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

memory.mode

character, optional
Indicates the memory mode the algorithm uses.

  • "auto": Chosen by the algorithm.

  • "optimized.speed": Priorities speed.

  • "optimized.space": Priorities saving memory.

Only valid when accelerated is TRUE. Defaults to "auto".

accelerated

logical, optional
Indicates whether or not to accelerate the calculation process.
Defaults to FALSE.

use.fast.library

logical, optional
Use vectorized accelerated operation when it is set to TRUE.
Not valid when accelerated is TRUE.
Defaults to FALSE.

use.float

logical, optional
If FALSE, use double and if TRUE, use float.
Only valid when use.fast.library is TRUE. Not valid when accelerated is TRUE.
Defaults to TRUE.

Value

An R6 object of class "KMeans" with the following attributes and methods:
Attributes

  • labels : DataFrame
    Label assigned to each sample.

  • cluster.centers : DataFrame
    Coordinates of cluster centers.

  • model : DataFrame
    Model content.

  • statistics : DataFrame
    Statistic value.

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > km <- hanaml.KMeans(data=df, key=df$columns[[1]])
   > km$CreateModelState()


Arguments:

  • model: DataFrame
    DataFrame containing the model for parsing.
    Defaults to self$model.

  • algorithm: character
    Specifies the PAL algorithm associated with model.
    Defaults to self$pal.algorithm.

  • func: character
    Specifies the functionality for Unified Classification/Regression.
    Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
    Defaults to self$func.

  • state.description: character
    A summary string for the generated model state.
    Defaults to "ModelState".

  • force: logic
    Specifies whether or not the replace existing state for model.
    Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > km <- hanaml.KMeans(data=df, key=df$columns[[1]])
   > km$CreateModelState()


After using the model state for real-time scoring, we can delete the state by calling:


   > km$DelateModelState()


Arguments:

  • state: DataFrame
    DataFrame containing the state info.
    Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Examples

Input DataFrame data:


> data$Collect()
    ID V000 V001  V002
 1   0  0.5    A   0.5
 2   1  1.5    A   0.5
 3   2  1.5    A   1.5
 4   3  0.5    A   1.5
 5   4  1.1    B   1.2
 6   5  0.5    B  15.5
 7   6  1.5    B  15.5
 8   7  1.5    B  16.5
 9   8  0.5    B  16.5
 10  9  1.2    C  16.1
 11 10 15.5    C  15.5
 12 11 16.5    C  15.5
 13 12 16.5    C  16.5
 14 13 15.5    C  16.5
 15 14 15.6    D  16.2
 16 15 15.5    D   0.5
 17 16 16.5    D   0.5
 18 17 16.5    D   1.5
 19 18 15.5    D   1.5
 20 19 15.7    A   1.6
 

Call the function:


> km <- hanaml.KMeans(data = data,
                      key = "ID",
                      features = NULL,
                      n.clusters = 4,
                      init = "first.k",
                      max.iter = 100,
                      tol = 1.0E-6,
                      thread.ratio = 0.2,
                      distance.level = "euclidean",
                      category.weights = 0.5)

Output:


> km$labels$Collect()
      ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
1      0           0  0.891088           0.944370
2      1           0  0.863917           0.942478
3      2           0  0.806252           0.946288
4      3           0  0.835684           0.944942
5      4           0  0.744571           0.950234
6      5           3  0.891088           0.940733

See also