hanaml.UnifiedClustering is an R wrapper for SAP HANA PAL Unified Clustering.

hanaml.UnifiedClustering(
  data = NULL,
  func = NULL,
  key = NULL,
  features = NULL,
  ...
)

Arguments

data

DataFrame
DataFrame containting the data.

func

character
The functionality for unified Clustering.
Valid values are as follows:
"AgglomerateHierarchicalClustering", "DBSCAN", "GaussianMixture", "AcceleratedKMeans", "KMeans", "KMedians", "KMedoids", "SOM".

key

character
Name of the ID column.

features

character or list of characters, optional
Names of features columns.
If is not provided, it defaults to all non-key columns of data.

...


Specifies other parameters for training a clustering model with the functionality specified in func.
Please see the documentation of corresponding functionalities for more detail.
hanaml.AgglomerateHierarchical, hanaml.DBSCAN, hanaml.GaussianMixture, hanaml.KMeans, hanaml.KMedian, hanaml.KMedoid, hanaml.SOM

Value

Returns a "UnifiedClustering" object with the following attributes and methods:

labels DataFrame

  • DATA_ID - ID column in the input data.

  • CLUSTER_ID - The assigned cluster ID.

  • DISTANCE - Distance between a given point and the cluster center (k-means) nearest core object (DBSCAN) weight vector (SOM) Or probability of a given point belonging to the corresponding cluster (GMM).

  • SLIGHT_SILHOUETTE - Estimated value (slight silhouette).

centers DataFrame

  • CLUSTER_ID

  • VARIABLE_NAME - The name of variable.

  • VALUE - The value of variable.

model DataFrame

  • ROW_INDEX - model row index.

  • PART_INDEX - Specifically for GMM's CLUSTER_ID.

  • MODEL_CONTENT - model content.

statistics DataFrame

  • STAT_NAME - Statistics name.

  • STAT_VALUE - Statistics value.

optimal.param DataFrame

  • PARM_NAME - parameter name.

  • INT_VALUE - integer value.

  • DOUBLE_VALUE - double value.

  • STRING_VALUE - character value.

Examples

The training data:

 > data.fit$Collect()
     ID  V000 V001  V002
 1    0   0.5    A   0.5
 2    1   1.5    A   0.5
 3    2   1.5    A   1.5
 4    3   0.5    A   1.5
 5    4   1.1    B   1.2
 6    5   0.5    B  15.5
 7    6   1.5    B  15.5
 8    7   1.5    B  16.5
 9    8   0.5    B  16.5
 10   9   1.2    C  16.1
 11  10  15.5    C  15.5
 12  11  16.5    C  15.5
 13  12  16.5    C  16.5
 14  13  15.5    C  16.5
 15  14  15.6    D  16.2
 16  15  15.5    D   0.5
 17  16  16.5    D   0.5
 18  17  16.5    D   1.5
 19  18  15.5    D   1.5
 20  19  15.7    A   1.6

Create a UnifiedClustering model for Kmeans:

ukmeans <- hanaml.UnifiedClustering(data = data.fit,
                                    n.clusters=4,
                                    init="first.k",
                                    max.iter=100,
                                    tol=1.0E-6,
                                    thread.ratio=1.0,
                                    distance.level="Euclidean",
                                    category.weights=0.5)

Check the labels:

> ukmeans$labels$Collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETE
1    0           0  0.891088          0.944370
2    1           0  0.863917          0.942478
3    2           0  0.806252          0.946288
4    3           0  0.835684          0.944942
......
17  16           1  0.976885          0.939386
18  17           1  0.818178          0.945878
19  18           1  0.722799          0.952170
20  19           1  1.102342          0.925679

See also