UnifiedClustering
- class hana_ml.algorithms.pal.unified_clustering.UnifiedClustering(func, massive=False, group_params=None, **kwargs)
The Python wrapper for SAP HANA PAL Unified Clustering function.
The clustering algorithms include:
'AgglomerateHierarchicalClustering'
'DBSCAN'
'GaussianMixture'
'AcceleratedKMeans'
'KMeans'
'KMedians'
'KMedoids'
'SOM'
'AffinityPropagation'
'SpectralClustering'
For GaussianMixture, you must configure
init_mode
andn_components
orinit_centers
parameters to define INITIALIZE_PARAMETER in SAP HANA PAL.Compared to the original KMedians and KMedoids, UnifiedClustering creates models after a training and then performs cluster assignment through the model.
- Parameters:
- funcstr
The name of a specified clustering algorithm. The following algorithms are supported:
'AgglomerateHierarchicalClustering'
'DBSCAN'
'GaussianMixture'
'AcceleratedKMeans'
'KMeans'
'KMedians'
'KMedoids'
'SOM'
'AffinityPropagation'
'SpectralClustering'
- massivebool, optional
Specifies whether or not to use massive mode of unified clustering.
True : massive mode.
False : single mode.
For parameter setting in massive mode, you could use both group_params (please see the example below) or the original parameters. Using original parameters will apply for all groups. However, if you define some parameters of a group, the value of all original parameter setting will be not applicable to such group.
An example is as follows:
In this example, as 'thread_ratio' is set in group_params for Group_1, parameter setting of 'metric' is not applicable to Group_1.
Defaults to False.
- group_paramsdict, optional
If massive mode is activated (
massive
is True), input data for clustering shall be divided into different groups with different clustering parameters applied. This parameter specifies the parameter values of the chosen clustering algorithmfunc
w.r.t. different groups in a dict format, where keys corresponding togroup_key
while values should be a dict for clustering algorithm parameter value assignments.An example is as follows:
Valid only when
massive
is True and defaults to None.- **kwargskeyword arguments
Arbitrary keyword arguments and please referred to the responding algorithm for the parameters' key-value pair.
Note
Some parameters are disabled in the clustering algorithm!
'AgglomerateHierarchicalClustering' :
AgglomerateHierarchicalClustering
Note that
distance_level
is supported which has the same options asaffinity
. If both parameters are entered,distance_level
takes precedence overaffinity
.
'DBSCAN' :
DBSCAN
Note that
distance_level
is supported which has the same options asmetric
. If both parameters are entered,distance_level
takes precedence overmetric
.
'GMM' :
GaussianMixture
'AcceleratedKMeans' :
KMeans
Note that parameter
accelerated
is not valid in this function.
'KMeans' :
KMeans
'KMedians' :
KMedians
'KMedoids' :
KMedoids
'SOM' :
SOM
'AffinityPropagation' :
AffinityPropagation
'SpectralClustering' :
SpectralClustering
For more parameter mappings of hana_ml and HANA PAL, please refer to the doc page: Parameter Mappings
References
For precomputed distance matrix as input data, please see:
Examples
>>> kmeans_params = dict(n_clusters=4, init='first_k', max_iter=100, tol=1.0E-6, thread_ratio=1.0, distance_level='Euclidean', category_weights=0.5) >>> ukmeans = UnifiedClustering(func='Kmeans', **kmeans_params)
Perform fit():
>>> ukmeans.fit(data=df_train, key='ID') >>> ukmeans.label_.collect()
Perform predict():
>>> result = ukmeans.predict(data=df_predict, key='ID') >>> result.collect()
- Attributes:
- labels_DataFrame
Label assigned to each sample. Also includes Distance between a given point and the cluster center (k-means), nearest core object (DBSCAN), weight vector (SOM) Or probability of a given point belonging to the corresponding cluster (GMM).
- centers_DataFrame
Coordinates of cluster centers.
- model_DataFrame
Model content.
- statistics_DataFrame
Statistics.
- optimal_param_DataFrame
Provides optimal parameters selected.
Available only when parameter selection is triggered.
- error_msg_DataFrame
Error message. Only valid if
massive
is True when initializing an 'UnifiedClustering' instance.
Methods
fit
(data[, key, features, group_key, ...])Fit function for unified clustering.
Get the model metrics.
Get the score metrics.
predict
(data[, key, group_key, features, model])Predict with the clustering model.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
- fit(data, key=None, features=None, group_key=None, group_params=None, categorical_variable=None, string_variable=None, variable_weight=None)
Fit function for unified clustering.
- Parameters:
- dataDataFrame
Training data.
If precomputed distance matrix as input data, please enter the DataFrame in the following structure:
single mode, structured as follows: - 1st column, type INTEGER, VARCHAR, or NVARCHAR, left point. - 2nd column, type INTEGER, VARCHAR, or NVARCHAR, right point. - 3rd column, type DOUBLE, distance.
massive mode, structured as follows: - 1st column, type INTEGER, VARCHAR, or NVARCHAR, group ID. - 2nd column, type INTEGER, VARCHAR, or NVARCHAR, left point. - 3rd column, type INTEGER, VARCHAR, or NVARCHAR, right point. - 4th column, type DOUBLE, distance.
- keystr, optional
Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- group_keystr, optional
The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group_params are valid.
This parameter is only valid when massive mode is activated in class instance initialization(i.e. parameter
massive
is set as True).Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- string_variablestr or a list of str, optional
Indicates a string column storing not categorical data.
Levenshtein distance is used to calculate similarity between two strings.
Ignored if it is not a string column. Only valid for DBSCAN.
Defaults to None.
- variable_weightdict, optional
Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0.
Defaults to 1 for variables not specified. Only valid for DBSCAN.
Defaults to None.
- group_paramsdict, optional
If massive mode is activated (
massive
is set as True in class instance initialization), input data for clustering shall be divided into different groups with different clustering parameters applied. This parameter specifies the parameter values of the chosen clustering algorithmfunc
w.r.t. different groups in a dict format, where keys corresponding togroup_key
while values should be a dict for clustering algorithm parameter value assignments.An example is as follows:
Valid only when
massive
is set as True in class instance initialization.Defaults to None.
- Returns:
- A fitted object of class "UnifiedClustering".
- predict(data, key=None, group_key=None, features=None, model=None)
Predict with the clustering model.
Cluster assignment is a unified interface to call a cluster assignment algorithm to assign data to clusters that are previously generated by some clustering methods, including K-Means, Accelerated K-Means, K-Medians, K-Medoids, DBSCAN, SOM, and GMM.
AgglomerateHierarchicalClustering does not provide predict function!
- Parameters:
- dataDataFrame
Data to be predicted.
If precomputed distance matrix as input data, please enter the DataFrame in the following structure:
single mode, structured as follows: - 1st column, type INTEGER, VARCHAR, or NVARCHAR, left point. - 2nd column, type INTEGER, VARCHAR, or NVARCHAR, right point. - 3rd column, type DOUBLE, distance.
massive mode, structured as follows: - 1st column, type INTEGER, VARCHAR, or NVARCHAR, group ID. - 2nd column, type INTEGER, VARCHAR, or NVARCHAR, left point. - 3rd column, type INTEGER, VARCHAR, or NVARCHAR, right point. - 4th column, type DOUBLE, distance.
- keystr, optional
Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index of column of data is not provided, please enter the value of key.
- group_keystr, optional
The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters specified in the
group_params
in class instance initialization are valid.This parameter is only valid when massive mode is activated(i.e.
massive
is set as True in class instance initialization).Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.
- featuresa list of str, optional
Names of feature columns in data for prediction.
Defaults all non-ID columns in
data
if not provided.- modelDataFrame, optional
A fitted clustering model.
Defaults to self.model_.
- Returns:
- DataFrame 1
Cluster assignment result, structured as follows:
1st column : Data ID
2nd column : Assigned cluster ID
3rd column : Distance metric between a given point and the assigned cluster. For different functions, this could be:
Distance between a given point and the cluster center(k-means, k-medians, k-medoids)
Distance between a given point and the nearest core object(DBSCAN)
Distance between a given point and the weight vector(SOM)
Probability of a given point belonging to the corresponding cluster(GMM)
- DataFrame 2 (optional)
Error message. Only valid if
massive
is True when initializing an 'UnifiedClustering' instance.
Inherited Methods from PALBase
Besides those methods mentioned above, the UnifiedClustering class also inherits methods from PALBase class, please refer to PAL Base for more details.