KMeans

class hana_ml.algorithms.pal.clustering.KMeans(n_clusters=None, n_clusters_min=None, n_clusters_max=None, init=None, max_iter=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None, tol=None, memory_mode=None, accelerated=False, use_fast_library=None, use_float=None)

K-means is one of the simplest and most commonly used unsupervised machine learning algorithms for partitioning a data set into K distinct, non-overlapping clusters based on the distances between the center of the cluster (centroid) and the data points.

Parameters:
n_clustersint, optional

Specifies the number of clusters (k) required. The acceptable range is between 2 to the number of training records.

If this parameter is not specified, you must specify the range of k using n_clusters_min'' and ``n_clusters_max parameters instead. Then the algorithm will iterate through the range and return the k with the highest slight silhouette.

n_clusters_minint, optional

Provides the lower boundary of the range that k falls in.

You must specify either an exact value (n_clusters) or a range (n_clusters_min'' and ``n_clusters_max) for k. If both are specified, the exact value will be used.

No default value.

n_clusters_maxint, optional

Provides the upper boundary of the range that k falls in.

You must specify either an exact value (n_clusters) or a range (n_clusters_min'' and ``n_clusters_max) for k. If both are specified, the exact value will be used.

No default value.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Governs the selection of initial cluster centers:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Specifies the maximal number of iterations to be performed.

Defaults to 100.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center.

'cosine' is only applicable when accelerated is set to False.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No normalization will be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

Defaults to None.

tolfloat, optional

Threshold (actual value) for exiting the iterations.

Only valid when accelerated is False.

Defaults to 1.0e-6.

memory_mode{'auto', 'optimize-speed', 'optimize-space'}, optional

Indicates the memory mode that the algorithm uses.

  • 'auto': Chosen by algorithm.

  • 'optimize-speed': Prioritizes speed.

  • 'optimize-space': Prioritizes memory.

Only applicable when accelerated is set to True.

Defaults to 'auto'.

acceleratedbool, optional

Indicates whether to use technology like cache to accelerate the calculation process:

  • If True, the calculation process will be accelerated.

  • If False, the calculation process will not be accelerated.

Defaults to False.

use_fast_librarybool, optional

Use vectorized accelerated operation when it is set to True.

Defaults to False.

use_floatbool, optional

Floating point type:

  • True: float

  • False: double

Only valid when use_fast_library is set to True.

Defaults to True.

Examples

Input dataframe df for K Means:

>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Create a KMeans instance:

>>> km = clustering.KMeans(n_clusters=4, init='first_k',
...                        max_iter=100, tol=1.0E-6, thread_ratio=0.2,
...                        distance_level='Euclidean', category_weights=0.5)

Perform fit_predict():

>>> labels = km.fit_predict(data=df, 'ID')
>>> labels.collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  0.891088           0.944370
1    1           0  0.863917           0.942478
2    2           0  0.806252           0.946288
3    3           0  0.835684           0.944942
4    4           0  0.744571           0.950234
5    5           3  0.891088           0.940733
6    6           3  0.835684           0.944412
7    7           3  0.806252           0.946519
8    8           3  0.863917           0.946121
9    9           3  0.744571           0.949899
10  10           2  0.825527           0.945092
11  11           2  0.933886           0.937902
12  12           2  0.881692           0.945008
13  13           2  0.764318           0.949160
14  14           2  0.923456           0.939283
15  15           1  0.901684           0.940436
16  16           1  0.976885           0.939386
17  17           1  0.818178           0.945878
18  18           1  0.722799           0.952170
19  19           1  1.102342           0.925679

Input dataframe df for Accelerated K-Means :

>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A     0
1    1   1.5    A     0
2    2   1.5    A     1
3    3   0.5    A     1
4    4   1.1    B     1
5    5   0.5    B    15
6    6   1.5    B    15
7    7   1.5    B    16
8    8   0.5    B    16
9    9   1.2    C    16
10  10  15.5    C    15
11  11  16.5    C    15
12  12  16.5    C    16
13  13  15.5    C    16
14  14  15.6    D    16
15  15  15.5    D     0
16  16  16.5    D     0
17  17  16.5    D     1
18  18  15.5    D     1
19  19  15.7    A     1

Create Accelerated Kmeans instance:

>>> akm = clustering.KMeans(init='first_k',
...                         thread_ratio=0.5, n_clusters=4,
...                         distance_level='euclidean',
...                         max_iter=100, category_weights=0.5,
...                         categorical_variable=['V002'],
...                         accelerated=True)

Perform fit_predict() and obtain the result:

>>> labels = akm.fit_predict(data=df, key='ID')
>>> labels.collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  1.198938           0.006767
1    1           0  1.123938           0.068899
2    2           3  0.500000           0.572506
3    3           3  0.500000           0.598267
4    4           0  0.621517           0.229945
5    5           0  1.037500           0.308333
6    6           0  0.962500           0.358333
7    7           0  0.895513           0.402992
8    8           0  0.970513           0.352992
9    9           0  0.823938           0.313385
10  10           1  1.038276           0.931555
11  11           1  1.178276           0.927130
12  12           1  1.135685           0.929565
13  13           1  0.995685           0.934165
14  14           1  0.849615           0.944359
15  15           1  0.995685           0.934548
16  16           1  1.135685           0.929950
17  17           1  1.089615           0.932769
18  18           1  0.949615           0.937555
19  19           1  0.915565           0.937717
Attributes:
labels_DataFrame

Label assigned to each sample.

cluster_centers_DataFrame

Coordinates of cluster centers.

model_DataFrame

Model content.

statistics_DataFrame

Statistic value.

Methods

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

fit(data[, key, features, categorical_variable])

Fit the model when given training dataset.

fit_predict(data[, key, features, ...])

Fit with the dataset and return the labels.

predict(data[, key, features])

Assign clusters to data based on a fitted model.

set_model_state(state)

Set the model state by state information.

fit(data, key=None, features=None, categorical_variable=None)

Fit the model when given training dataset.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns:
A fitted object of class "KMeans".
fit_predict(data, key=None, features=None, categorical_variable=None)

Fit with the dataset and return the labels.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

Returns:
DataFrame

Label assigned to each sample.

predict(data, key=None, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters:
dataDataFrame

Data points to match against computed clusters.

This dataframe's column structure should match that of the data used for fit().

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional.

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

Returns:
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.

  • DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.

create_model_state(model=None, function=None, pal_funcname='PAL_CLUSTER_ASSIGNMENT', state_description=None, force=False)

Create PAL model state.

Parameters:
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for cluster assignment.

pal_funcnameint or str, optional

PAL function name. Must be a valid PAL procedure that supports model state.

Defaults to 'PAL_CLUSTER_ASSIGNMENT'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters:
stateDataFrame, optional

Specified the state.

Defaults to self.state.

set_model_state(state)

Set the model state by state information.

Parameters:
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the KMeans class also inherits methods from PALBase class, please refer to PAL Base for more details.