KMedoids
- class hana_ml.algorithms.pal.clustering.KMedoids(n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)
K-Medoids clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. K-medoids uses the most central observation, known as the medoid. K-Medoids is more robust to noise and outliers.
- Parameters:
- n_clustersint
Number of groups.
- init{'first_k', 'replace', 'no_replace', 'patent'}, optional
Controls how the initial centers are selected:
'first_k': First k observations.
'replace': Random with replacement.
'no_replace': Random without replacement.
'patent': Patent of selecting the init center (US 6,882,998 B1).
Defaults to 'patent'.
- max_iterint, optional
Max iterations.
Defaults to 100.
- tolfloat, optional
Convergence threshold for exiting iterations.
Defaults to 1.0e-6.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional
Ways to compute the distance between the item and the cluster center.
Defaults to 'euclidean'.
- minkowski_powerfloat, optional
When Minkowski distance is used, this parameter controls the value of power. Only valid when
distance_level
is 'minkowski'.Defaults to 3.0.
- category_weightsfloat, optional
Represents the weight of category attributes.
Defaults to 0.707.
- normalization{'no', 'l1_norm', 'min_max'}, optional
Normalization type.
'no': No, normalization will not be applied.
- 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized
value will be X'(x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.
- 'min_max': Yes, for each column C, get the min and max value of C,
and then C[i] = (C[i]-min)/(max-min).
Defaults to 'no'.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
Examples
Input DataFrame df:
>>> df.collect() ID V000 V001 V002 0 0 0.5 A 0.5 1 1 1.5 A 0.5 ... 18 18 15.5 D 1.5 19 19 15.7 A 1.6
Creating a KMedoids instance:
>>> kmedoids = KMedoids(n_clusters=4, init='first_K', ... max_iter=100, tol=1.0E-6, ... distance_level='Euclidean', ... thread_ratio=0.3, category_weights=0.5)
Performing fit() and obtain the result:
>>> kmedoids.fit(data=df, key='ID') >>> kmedoids.cluster_centers_.collect() CLUSTER_ID V000 V001 V002 0 0 1.5 A 1.5 1 1 15.5 D 1.5 2 2 15.5 C 16.5 3 3 1.5 B 16.5
Performing fit_predict():
>>> kmedoids.fit_predict(data=df, key='ID').collect() ID CLUSTER_ID DISTANCE 0 0 0 1.414214 1 1 0 1.000000 ... 18 18 1 0.000000 19 19 1 0.930714
- Attributes:
- cluster_centers_DataFrame
Coordinates of cluster centers.
- labels_DataFrame
Cluster assignment and distance to cluster center for each point.
Methods
fit
(data[, key, features, categorical_variable])Fit the model to the training dataset.
fit_predict
(data[, key, features, ...])Perform clustering algorithm and return labels.
Get the model metrics.
Get the score metrics.
- fit(data, key=None, features=None, categorical_variable=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
DataFrame containing the input data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of feature columns. If
features
is not provided, it defaults to all non-key columns.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- fit_predict(data, key=None, features=None, categorical_variable=None)
Perform clustering algorithm and return labels.
- Parameters:
- dataDataFrame
DataFrame containing input data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of feature columns.
If
features
is not provided, it defaults to all non-key columns.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- Returns:
- DataFrame
Fit result, structured as follows:
ID column, with the same name and type as
data
's ID column.CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.
DISTANCE, type DOUBLE, the distance between the given point and the cluster center.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the KMedoids class also inherits methods from PALBase class, please refer to PAL Base for more details.