UnifiedClustering

class hana_ml.algorithms.pal.unified_clustering.UnifiedClustering(func, massive=False, group_params=None, **kwargs)

The Python wrapper for SAP HANA PAL Unified Clustering function.

The clustering algorithms include:

  • 'AgglomerateHierarchicalClustering'

  • 'DBSCAN'

  • 'GaussianMixture'

  • 'AcceleratedKMeans'

  • 'KMeans'

  • 'KMedians'

  • 'KMedoids'

  • 'SOM'

  • 'AffinityPropagation'

  • 'SpectralClustering'

For GaussianMixture, you must configure init_mode and n_components or init_centers parameters to define INITIALIZE_PARAMETER in SAP HANA PAL.

Compared to the original KMedians and KMedoids, UnifiedClustering creates models after a training and then performs cluster assignment through the model.

Parameters:
funcstr

The name of a specified clustering algorithm.

The following algorithms are supported:

  • 'AgglomerateHierarchicalClustering'

  • 'DBSCAN'

  • 'GaussianMixture'

  • 'AcceleratedKMeans'

  • 'KMeans'

  • 'KMedians'

  • 'KMedoids'

  • 'SOM'

  • 'AffinityPropagation'

  • 'SpectralClustering'

massivebool, optional

Specifies whether or not to use massive mode of unified clustering.

  • True : massive mode.

  • False : single mode.

For parameter setting in massive mode, you could use both group_params (please see the example below) or the original parameters. Using original parameters will apply for all groups. However, if you define some parameters of a group, the value of all original parameter setting will be not applicable to such group.

An example is as follows:

In this example, as 'thread_ratio' is set in group_params for Group_1, parameter setting of 'metric' is not applicable to Group_1.

Defaults to False.

group_paramsdict, optional

If massive mode is activated (massive is True), input data for clustering shall be divided into different groups with different clustering parameters applied. This parameter specifies the parameter values of the chosen clustering algorithm func w.r.t. different groups in a dict format, where keys corresponding to group_key while values should be a dict for clustering algorithm parameter value assignments.

An example is as follows:

Valid only when massive is True and defaults to None.

**kwargskeyword arguments

Arbitrary keyword arguments and please referred to the responding algorithm for the parameters' key-value pair.

Note

Some parameters are disabled in the clustering algorithm!

  • 'AgglomerateHierarchicalClustering' : AgglomerateHierarchicalClustering

    • Note that distance_level is supported which has the same options as affinity. If both parameters are entered, distance_level takes precedence over affinity.

  • 'DBSCAN' : DBSCAN

    • Note that distance_level is supported which has the same options as metric. If both parameters are entered, distance_level takes precedence over metric.

  • 'GMM' : GaussianMixture

  • 'AcceleratedKMeans' : KMeans

    • Note that parameter accelerated is not valid in this function.

  • 'KMeans' : KMeans

  • 'KMedians' : KMedians

  • 'KMedoids' : KMedoids

  • 'SOM' : SOM

  • 'AffinityPropagation' : AffinityPropagation

  • 'SpectralClustering' : SpectralClustering

For more parameter mappings of hana_ml and HANA PAL, please refer to the doc page: Parameter Mappings

References

For precomputed distance matrix as input data, please see:

  1. precomputed Distance Matrix as input data

Examples

Input DataFrame:

>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Create an UnifiedClustering instance:

>>> kmeans_params = dict(n_clusters=4, init='first_k', max_iter=100,
                         tol=1.0E-6, thread_ratio=1.0, distance_level='Euclidean',
                         category_weights=0.5)
>>> ukmeans = UnifiedClustering(func = 'Kmeans', **kmeans_params)

Fit the UnifiedClustering instance with the df:

>>> ukmeans.fit(data = df, key = 'ID')

Check the resulting statistics:

>>> ukmeans.label_.collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETE
0    0           0  0.891088          0.944370
1    1           0  0.863917          0.942478
2    2           0  0.806252          0.946288
3    3           0  0.835684          0.944942
......
16  16           1  0.976885          0.939386
17  17           1  0.818178          0.945878
18  18           1  0.722799          0.952170
19  19           1  1.102342          0.925679

Data for prediction (cluster assignment):

>>> df_pred.collect()
   ID  CLUSTER_ID  DISTANCE
0  88           3  0.981659
1  89           3  0.826454
2  90           2  1.990205
3  91           2  0.325812

Perform predict():

>>> result = ukmeans.predict(data = df_pred, key = 'ID')
>>> result.collect()
   ID  CLUSTER_ID  DISTANCE
0  88           3  0.981659
1  89           3  0.826454
2  90           2  1.990205
3  91           2  0.325812
Attributes:
labels_DataFrame

Label assigned to each sample. Also includes Distance between a given point and the cluster center (k-means), nearest core object (DBSCAN), weight vector (SOM) Or probability of a given point belonging to the corresponding cluster (GMM).

centers_DataFrame

Coordinates of cluster centers.

model_DataFrame

Model content.

statistics_DataFrame

Names and values of statistics.

optimal_param_DataFrame

Provides optimal parameters selected.

Available only when parameter selection is triggered.

error_msg_DataFrame

Error message. Only valid if massive is True when initializing an 'UnifiedClustering' instance.

Methods

fit(data[, key, features, group_key, ...])

Fit function for unified clustering.

predict(data[, key, group_key, features, model])

Predict with the clustering model.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

fit(data, key=None, features=None, group_key=None, group_params=None, categorical_variable=None, string_variable=None, variable_weight=None)

Fit function for unified clustering.

Parameters:
dataDataFrame

Training data.

If precomputed distance matrix as input data, please enter the DataFrame in the following structure:

  • single mode, structured as follows: - 1st column, type INTEGER, VARCHAR, or NVARCHAR, left point. - 2nd column, type INTEGER, VARCHAR, or NVARCHAR, right point. - 3rd column, type DOUBLE, distance.

  • massive mode, structured as follows: - 1st column, type INTEGER, VARCHAR, or NVARCHAR, group ID. - 2nd column, type INTEGER, VARCHAR, or NVARCHAR, left point. - 3rd column, type INTEGER, VARCHAR, or NVARCHAR, right point. - 4th column, type DOUBLE, distance.

keystr, optional

Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

group_keystr, optional

The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group_params are valid.

This parameter is only valid when massive mode is activated in class instance initialization(i.e. parameter massive is set as True).

Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

No default value.

string_variablestr or list of str, optional

Indicates a string column storing not categorical data.

Levenshtein distance is used to calculate similarity between two strings.

Ignored if it is not a string column. Only valid for DBSCAN.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0.

Defaults to 1 for variables not specified. Only valid for DBSCAN.

Defaults to None.

group_paramsdict, optional

If massive mode is activated (massive is set as True in class instance initialization), input data for clustering shall be divided into different groups with different clustering parameters applied. This parameter specifies the parameter values of the chosen clustering algorithm func w.r.t. different groups in a dict format, where keys corresponding to group_key while values should be a dict for clustering algorithm parameter value assignments.

An example is as follows:

Valid only when massive is set as True in class instance initialization.

Defaults to None.

Returns:
A fitted object of 'UnifiedClustering'.
predict(data, key=None, group_key=None, features=None, model=None)

Predict with the clustering model.

Cluster assignment is a unified interface to call a cluster assignment algorithm to assign data to clusters that are previously generated by some clustering methods, including K-Means, Accelerated K-Means, K-Medians, K-Medoids, DBSCAN, SOM, and GMM.

AgglomerateHierarchicalClustering does not provide predict function!

Parameters:
dataDataFrame

Data to be predicted.

If precomputed distance matrix as input data, please enter the DataFrame in the following structure:

  • single mode, structured as follows: - 1st column, type INTEGER, VARCHAR, or NVARCHAR, left point. - 2nd column, type INTEGER, VARCHAR, or NVARCHAR, right point. - 3rd column, type DOUBLE, distance.

  • massive mode, structured as follows: - 1st column, type INTEGER, VARCHAR, or NVARCHAR, group ID. - 2nd column, type INTEGER, VARCHAR, or NVARCHAR, left point. - 3rd column, type INTEGER, VARCHAR, or NVARCHAR, right point. - 4th column, type DOUBLE, distance.

keystr, optional

Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.

group_keystr, optional

The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters specified in the group_params in class instance initialization are valid.

This parameter is only valid when massive mode is activated(i.e. massive is set as True in class instance initialization).

Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.

featuresa list of str, optional

Names of feature columns in data for prediction.

Defaults all non-ID columns in data if not provided.

modelDataFrame

Fitted clustering model.

Defaults to self.model_.

Returns:
DataFrame 1

Cluster assignment result, structured as follows:

1st column : Data ID

2nd column : Assigned cluster ID

3rd column : Distance metric between a given point and the assigned cluster. For different functions, this could be:

  • Distance between a given point and the cluster center(k-means, k-medians, k-medoids)

  • Distance between a given point and the nearest core object(DBSCAN)

  • Distance between a given point and the weight vector(SOM)

  • Probability of a given point belonging to the corresponding cluster(GMM)

DataFrame 2 (optional)

Error message. Only valid if massive is True when initializing an 'UnifiedClustering' instance.

Inherited Methods from PALBase

Besides those methods mentioned above, the UnifiedClustering class also inherits methods from PALBase class, please refer to PAL Base for more details.