KMedoids

class hana_ml.algorithms.pal.clustering.KMedoids(n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

K-Medoids clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. K-medoids uses the most central observation, known as the medoid. K-Medoids is more robust to noise and outliers.

Parameters:
n_clustersint

Number of groups.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Max iterations.

Defaults to 100.

tolfloat, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-6.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.

Values between 0 and 1 will use up to that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No, normalization will not be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Defaults to None.

Examples

Input dataframe df1 for clustering:

>>> df1.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Creating a KMedoids instance:

>>> kmedoids = KMedoids(n_clusters=4, init='first_K',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',
...                     thread_ratio=0.3, category_weights=0.5)

Performing fit() on given dataframe:

>>> kmedoids.fit(data=df1, key='ID')
>>> kmedoids.cluster_centers_.collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.5    A   1.5
1           1  15.5    D   1.5
2           2  15.5    C  16.5
3           3   1.5    B  16.5

Performing fit_predict() on given dataframe:

>>> kmedoids.fit_predict(data=df1, key='ID').collect()
    ID  CLUSTER_ID  DISTANCE
0    0           0  1.414214
1    1           0  1.000000
2    2           0  0.000000
3    3           0  1.000000
4    4           0  1.207107
5    5           3  1.414214
6    6           3  1.000000
7    7           3  0.000000
8    8           3  1.000000
9    9           3  1.207107
10  10           2  1.000000
11  11           2  1.414214
12  12           2  1.000000
13  13           2  0.000000
14  14           2  1.023335
15  15           1  1.000000
16  16           1  1.414214
17  17           1  1.000000
18  18           1  0.000000
19  19           1  0.930714
Attributes:
cluster_centers_DataFrame

Coordinates of cluster centers.

labels_DataFrame

Cluster assignment and distance to cluster center for each point.

Methods

fit(data[, key, features, categorical_variable])

Perform clustering on input dataset.

fit_predict(data[, key, features, ...])

Perform clustering algorithm and return labels.

fit(data, key=None, features=None, categorical_variable=None)

Perform clustering on input dataset.

Parameters:
dataDataFrame

DataFrame contains input data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

fit_predict(data, key=None, features=None, categorical_variable=None)

Perform clustering algorithm and return labels.

Parameters:
dataDataFrame

DataFrame containing input data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns:
DataFrame

Fit result, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.

  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the KMedoids class also inherits methods from PALBase class, please refer to PAL Base for more details.