KMedoids

class hana_ml.algorithms.pal.clustering.KMedoids(n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

K-Medoids clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. K-medoids uses the most central observation, known as the medoid. K-Medoids is more robust to noise and outliers.

Parameters:
n_clustersint

Number of groups.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Max iterations.

Defaults to 100.

tolfloat, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-6.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power. Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No, normalization will not be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized

    value will be X'(x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C,

    and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

Examples

Input DataFrame df:

>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
...
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Creating a KMedoids instance:

>>> kmedoids = KMedoids(n_clusters=4, init='first_K',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',
...                     thread_ratio=0.3, category_weights=0.5)

Performing fit() and obtain the result:

>>> kmedoids.fit(data=df, key='ID')
>>> kmedoids.cluster_centers_.collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.5    A   1.5
1           1  15.5    D   1.5
2           2  15.5    C  16.5
3           3   1.5    B  16.5

Performing fit_predict():

>>> kmedoids.fit_predict(data=df, key='ID').collect()
    ID  CLUSTER_ID  DISTANCE
0    0           0  1.414214
1    1           0  1.000000
...
18  18           1  0.000000
19  19           1  0.930714
Attributes:
cluster_centers_DataFrame

Coordinates of cluster centers.

labels_DataFrame

Cluster assignment and distance to cluster center for each point.

Methods

fit(data[, key, features, categorical_variable])

Fit the model to the training dataset.

fit_predict(data[, key, features, ...])

Perform clustering algorithm and return labels.

get_model_metrics()

Get the model metrics.

get_score_metrics()

Get the score metrics.

fit(data, key=None, features=None, categorical_variable=None)

Fit the model to the training dataset.

Parameters:
dataDataFrame

DataFrame containing the input data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns. If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

fit_predict(data, key=None, features=None, categorical_variable=None)

Perform clustering algorithm and return labels.

Parameters:
dataDataFrame

DataFrame containing input data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

Returns:
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

get_model_metrics()

Get the model metrics.

Returns:
DataFrame

The model metrics.

get_score_metrics()

Get the score metrics.

Returns:
DataFrame

The score metrics.

Inherited Methods from PALBase

Besides those methods mentioned above, the KMedoids class also inherits methods from PALBase class, please refer to PAL Base for more details.