UnifiedClustering
- class hana_ml.algorithms.pal.unified_clustering.UnifiedClustering(func, massive=False, group_params=None, **kwargs)
The Python wrapper for SAP HANA PAL Unified Clustering function.
The clustering algorithms include:
'AgglomerateHierarchicalClustering'
'DBSCAN'
'GaussianMixture'
'AcceleratedKMeans'
'KMeans'
'KMedians'
'KMedoids'
'SOM'
'AffinityPropagation'
'SpectralClustering'
For GaussianMixture, you must configure
init_modeandn_componentsorinit_centersparameters to define INITIALIZE_PARAMETER in SAP HANA PAL.Compared to the original KMedians and KMedoids, UnifiedClustering creates models after a training and then performs cluster assignment through the model.
- Parameters:
- funcstr
The name of a specified clustering algorithm. The following algorithms are supported:
'AgglomerateHierarchicalClustering'
'DBSCAN'
'GaussianMixture'
'AcceleratedKMeans'
'KMeans'
'KMedians'
'KMedoids'
'SOM'
'AffinityPropagation'
'SpectralClustering'
- massivebool, optional
Specifies whether or not to use massive mode of unified clustering.
True : massive mode.
False : single mode.
For parameter setting in massive mode, you could use both group_params (please see the example below) or the original parameters. Using original parameters will apply for all groups. However, if you define some parameters of a group, the value of all original parameter setting will be not applicable to such group.
An example is as follows:
In this example, as 'thread_ratio' is set in group_params for Group_1, parameter setting of 'metric' is not applicable to Group_1.
Defaults to False.
- group_paramsdict, optional
If massive mode is activated (
massiveis True), input data for clustering shall be divided into different groups with different clustering parameters applied. This parameter specifies the parameter values of the chosen clustering algorithmfuncw.r.t. different groups in a dict format, where keys corresponding togroup_keywhile values should be a dict for clustering algorithm parameter value assignments.An example is as follows:
Valid only when
massiveis True and defaults to None.- **kwargskeyword arguments
Arbitrary keyword arguments and please referred to the responding algorithm for the parameters' key-value pair.
Note
Some parameters are disabled in the clustering algorithm!
'AgglomerateHierarchicalClustering' :
AgglomerateHierarchicalClusteringNote that
distance_levelis supported which has the same options asaffinity. If both parameters are entered,distance_leveltakes precedence overaffinity.
'DBSCAN' :
DBSCANNote that
distance_levelis supported which has the same options asmetric. If both parameters are entered,distance_leveltakes precedence overmetric.
'GMM' :
GaussianMixture'AcceleratedKMeans' :
KMeansNote that parameter
acceleratedis not valid in this function.
'KMeans' :
KMeans'KMedians' :
KMedians'KMedoids' :
KMedoids'SOM' :
SOM'AffinityPropagation' :
AffinityPropagation'SpectralClustering' :
SpectralClustering
For more parameter mappings of hana_ml and HANA PAL, please refer to the doc page: Parameter Mappings
- Attributes:
- labels_DataFrame
Label assigned to each sample. Also includes Distance between a given point and the cluster center (k-means), nearest core object (DBSCAN), weight vector (SOM) Or probability of a given point belonging to the corresponding cluster (GMM).
- centers_DataFrame
Coordinates of cluster centers.
- model_DataFrame
Model content.
- statistics_DataFrame
Statistics.
- optimal_param_DataFrame
Provides optimal parameters selected.
Available only when parameter selection is triggered.
- error_msg_DataFrame
Error message. Only valid if
massiveis True when initializing an 'UnifiedClustering' instance.
Methods
fit(data[, key, features, group_key, ...])Fit function for unified clustering.
predict(data[, key, group_key, features, model])Predict with the clustering model.
References
For precomputed distance matrix as input data, please see:
Examples
>>> kmeans_params = dict(n_clusters=4, init='first_k', max_iter=100, tol=1.0E-6, thread_ratio=1.0, distance_level='Euclidean', category_weights=0.5) >>> ukmeans = UnifiedClustering(func='Kmeans', **kmeans_params)
Perform fit():
>>> ukmeans.fit(data=df_train, key='ID') >>> ukmeans.label_.collect()
Perform predict():
>>> result = ukmeans.predict(data=df_predict, key='ID') >>> result.collect()
- fit(data, key=None, features=None, group_key=None, group_params=None, categorical_variable=None, string_variable=None, variable_weight=None)
Fit function for unified clustering.
- Parameters:
- dataDataFrame
Training data.
If precomputed distance matrix as input data, please enter the DataFrame in the following structure:
single mode, structured as follows: - 1st column, type INTEGER, VARCHAR, or NVARCHAR, left point. - 2nd column, type INTEGER, VARCHAR, or NVARCHAR, right point. - 3rd column, type DOUBLE, distance.
massive mode, structured as follows: - 1st column, type INTEGER, VARCHAR, or NVARCHAR, group ID. - 2nd column, type INTEGER, VARCHAR, or NVARCHAR, left point. - 3rd column, type INTEGER, VARCHAR, or NVARCHAR, right point. - 4th column, type DOUBLE, distance.
- keystr, optional
Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all non-ID columns.- group_keystr, optional
The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. This parameter is only valid when massive mode is activated in class instance initialization(i.e. parameter
massiveis set as True).Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- string_variablestr or a list of str, optional
Indicates a string column storing not categorical data.
Levenshtein distance is used to calculate similarity between two strings.
Ignored if it is not a string column. Only valid for DBSCAN.
Defaults to None.
- variable_weightdict, optional
Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0.
Defaults to 1 for variables not specified. Only valid for DBSCAN.
Defaults to None.
- group_paramsdict, optional
If massive mode is activated (
massiveis set as True in class instance initialization), input data for clustering shall be divided into different groups with different clustering parameters applied. This parameter specifies the parameter values of the chosen clustering algorithmfuncw.r.t. different groups in a dict format, where keys corresponding togroup_keywhile values should be a dict for clustering algorithm parameter value assignments.An example is as follows:
Valid only when
massiveis set as True in class instance initialization.Defaults to None.
- Returns:
- A fitted object of class "UnifiedClustering".
- predict(data, key=None, group_key=None, features=None, model=None)
Predict with the clustering model.
Cluster assignment is a unified interface to call a cluster assignment algorithm to assign data to clusters that are previously generated by some clustering methods, including K-Means, Accelerated K-Means, K-Medians, K-Medoids, DBSCAN, SOM, and GMM.
AgglomerateHierarchicalClustering does not provide predict function!
- Parameters:
- dataDataFrame
Data to be predicted.
If precomputed distance matrix as input data, please enter the DataFrame in the following structure:
single mode, structured as follows: - 1st column, type INTEGER, VARCHAR, or NVARCHAR, left point. - 2nd column, type INTEGER, VARCHAR, or NVARCHAR, right point. - 3rd column, type DOUBLE, distance.
massive mode, structured as follows: - 1st column, type INTEGER, VARCHAR, or NVARCHAR, group ID. - 2nd column, type INTEGER, VARCHAR, or NVARCHAR, left point. - 3rd column, type INTEGER, VARCHAR, or NVARCHAR, right point. - 4th column, type DOUBLE, distance.
- keystr, optional
Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.
- group_keystr, optional
The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters specified in the
group_paramsin class instance initialization are valid.This parameter is only valid when massive mode is activated(i.e.
massiveis set as True in class instance initialization).Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.
- featuresa list of str, optional
Names of feature columns in data for prediction.
Defaults all non-ID columns in
dataif not provided.- modelDataFrame, optional
A fitted clustering model.
Defaults to self.model_.
- Returns:
- DataFrame 1
Cluster assignment result, structured as follows:
1st column : Data ID
2nd column : Assigned cluster ID
3rd column : Distance metric between a given point and the assigned cluster. For different functions, this could be:
Distance between a given point and the cluster center(k-means, k-medians, k-medoids)
Distance between a given point and the nearest core object(DBSCAN)
Distance between a given point and the weight vector(SOM)
Probability of a given point belonging to the corresponding cluster(GMM)
- DataFrame 2 (optional)
Error message. Only valid if
massiveis True when initializing an 'UnifiedClustering' instance.
Inherited Methods from PALBase
Besides those methods mentioned above, the UnifiedClustering class also inherits methods from PALBase class, please refer to PAL Base for more details.