hanaml.UnifiedClustering is an R wrapper for SAP HANA PAL Unified Clustering.

hanaml.UnifiedClustering(
  data = NULL,
  func = NULL,
  key = NULL,
  features = NULL,
  massive = FALSE,
  group.key = NULL,
  group.params = NULL,
  ...
)

Arguments

data

DataFrame
DataFrame containting the data.

func

character
The functionality for unified Clustering.

  • "AgglomerateHierarchicalClustering"

  • "DBSCAN"

  • "GaussianMixture"

  • "AcceleratedKMeans"

  • "KMeans"

  • "KMedians"

  • "KMedoids"

  • "SOM"

  • "AffinityPropagation"

key

character, optional
Name of the ID column.
Defaults to the first column if not provided.

features

character or list of characters, optional
Names of features columns.
If is not provided, it defaults to all non-key columns of data.

massive

logical, optional
Specifies whether or not to use massive mode.
For parameter setting in massive mode, you could use both group.params (please see the example below) or the original parameters. Using original parameters will apply for all groups. However, if you define some parameters of a group, the value of all original parameter setting will be not applicable to such group.
An example is as follows:

udbscan <- hanaml.UnifiedClustering(data = df.fit,
                                        group.key = "GROUP_ID",
                                        func = 'DBSCAN',
                                        thread.ratio=1.0,
                                        key='ID',
                                        massive=TRUE,
                                        group.params = list(
                                          'Group_1'=list(metric='manhattan')))

In this example, as metric='manhattan' is set in group.params for Group_1, parameter setting of thread.ratio=1.0 is not applicable to Group_1.
Defaults to FALSE.

group.key

character, optional
The column of group key. The data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group.params are valid. This parameter is only valid when massive is TRUE.
Defaults to the first column of data if group.key is not provided.

group.params

list, optional
If the massive mode is activated (massive=TRUE), input data shall be divided into different groups with different parameters applied.
An example is as follows:

udbscan <- hanaml.UnifiedClustering(data = df.fit,
                                        group.key = "GROUP_ID",
                                        func = 'DBSCAN',
                                        thread.ratio = 1.0,
                                        key = 'ID',
                                        massive = TRUE,
                                        group.params = ist(
                                          'Group_1'=list(metric='manhattan')))
    res <- predict(model = udbscan,
                   data = df.predict,
                   group.key = "GROUP_ID",
                   key = 'ID')

Valid only when massive is TRUE and defaults to NULL.

...


Specifies other parameters for training a clustering model with the functionality specified in func.
Please see the documentation of corresponding functionalities for more detail.
hanaml.AgglomerateHierarchical, hanaml.DBSCAN, hanaml.GaussianMixture, hanaml.KMeans, hanaml.KMedian, hanaml.KMedoid, hanaml.SOM, hanaml.AffinityPropagation

Value

Returns a "hanaml.UnifiedClustering" object with the following attributes and methods:

labelsDataFrame

  • DATA_ID - ID column in the input data.

  • CLUSTER_ID - The assigned cluster ID.

  • DISTANCE - Distance between a given point and the cluster center (k-means) nearest core object (DBSCAN) weight vector (SOM) Or probability of a given point belonging to the corresponding cluster (GMM).

  • SLIGHT_SILHOUETTE - Estimated value (slight silhouette).

centersDataFrame

  • CLUSTER_ID

  • VARIABLE_NAME - The name of variable.

  • VALUE - The value of variable.

modelDataFrame

  • ROW_INDEX - model row index.

  • PART_INDEX - Specifically for GMM's CLUSTER_ID.

  • MODEL_CONTENT - model content.

statisticsDataFrame

  • STAT_NAME - Statistics name.

  • STAT_VALUE - Statistics value.

optimal.paramDataFrame

  • PARM_NAME - parameter name.

  • INT_VALUE - integer value.

  • DOUBLE_VALUE - double value.

  • STRING_VALUE - character value.

error.msgDataFrame
Error message and only valid if massive is TRUE.

Examples

The training data:


 > data.fit$Collect()
     ID  V000 V001  V002
 1    0   0.5    A   0.5
 2    1   1.5    A   0.5
 3    2   1.5    A   1.5
 4    3   0.5    A   1.5
 5    4   1.1    B   1.2
 6    5   0.5    B  15.5
 7    6   1.5    B  15.5
 8    7   1.5    B  16.5
 9    8   0.5    B  16.5
 10   9   1.2    C  16.1
 11  10  15.5    C  15.5
 12  11  16.5    C  15.5
 13  12  16.5    C  16.5
 14  13  15.5    C  16.5
 15  14  15.6    D  16.2
 16  15  15.5    D   0.5
 17  16  16.5    D   0.5
 18  17  16.5    D   1.5
 19  18  15.5    D   1.5
 20  19  15.7    A   1.6

Create a UnifiedClustering model for Kmeans:

ukmeans <- hanaml.UnifiedClustering(data = data.fit,
                                    n.clusters=4,
                                    init="first.k",
                                    max.iter=100,
                                    tol=1.0E-6,
                                    thread.ratio=1.0,
                                    distance.level="Euclidean",
                                    category.weights=0.5)

Check the labels:


> ukmeans$labels$Collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETE
1    0           0  0.891088          0.944370
2    1           0  0.863917          0.942478
3    2           0  0.806252          0.946288
4    3           0  0.835684          0.944942
......
17  16           1  0.976885          0.939386
18  17           1  0.818178          0.945878
19  18           1  0.722799          0.952170
20  19           1  1.102342          0.925679