KMeans
- class hana_ml.algorithms.pal.clustering.KMeans(n_clusters=None, n_clusters_min=None, n_clusters_max=None, init=None, max_iter=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None, tol=None, memory_mode=None, accelerated=False, use_fast_library=None, use_float=None)
K-Means model that handles clustering problems.
- Parameters
- n_clustersint, optional
Number of clusters. If this parameter is not specified, you must specify the minimum and maximum range parameters instead.
- n_clusters_minint, optional
Cluster range minimum.
- n_clusters_maxint, optional
Cluster range maximum.
- init{'first_k', 'replace', 'no_replace', 'patent'}, optional
Controls how the initial centers are selected:
'first_k': First k observations.
'replace': Random with replacement.
'no_replace': Random without replacement.
'patent': Patent of selecting the init center (US 6,882,998 B1).
Defaults to 'patent'.
- max_iterint, optional
Max iterations.
Defaults to 100.
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by this function.
The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.
Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional
Ways to compute the distance between the item and the cluster center.
'cosine' is only valid when
accelerated
is False.Defaults to 'euclidean'.
- minkowski_powerfloat, optional
When Minkowski distance is used, this parameter controls the value of power.
Only valid when
distance_level
is 'minkowski'.Defaults to 3.0.
- category_weightsfloat, optional
Represents the weight of category attributes.
Defaults to 0.707.
- normalization{'no', 'l1_norm', 'min_max'}, optional
Normalization type.
'no': No normalization will be applied.
'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.
'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).
Defaults to 'no'.
- categorical_variablestr or a list of str, optional
Specifies INTEGER column(s) in the data that should be treated as categorical.
Defaults to None.
- tolfloat, optional
Convergence threshold for exiting iterations.
Only valid when
accelerated
is False.Defaults to 1.0e-6.
- memory_mode{'auto', 'optimize-speed', 'optimize-space'}, optional
Indicates the memory mode that the algorithm uses.
'auto': Chosen by algorithm.
'optimize-speed': Prioritizes speed.
'optimize-space': Prioritizes memory.
Only valid when
accelerated
is True.Defaults to 'auto'.
- acceleratedbool, optional
Indicates whether to use technology like cache to accelerate the calculation process:
If True, the calculation process will be accelerated.
If False, the calculation process will not be accelerated.
Defaults to False.
- use_fast_librarybool, optional
Use vectorized accelerated operation when it is set to True.
Defaults to False.
- use_floatbool, optional
False: double
True: float
Only valid when use_fast_library is True.
Defaults to True.
Examples
Input dataframe df for K Means:
>>> df.collect() ID V000 V001 V002 0 0 0.5 A 0.5 1 1 1.5 A 0.5 2 2 1.5 A 1.5 3 3 0.5 A 1.5 4 4 1.1 B 1.2 5 5 0.5 B 15.5 6 6 1.5 B 15.5 7 7 1.5 B 16.5 8 8 0.5 B 16.5 9 9 1.2 C 16.1 10 10 15.5 C 15.5 11 11 16.5 C 15.5 12 12 16.5 C 16.5 13 13 15.5 C 16.5 14 14 15.6 D 16.2 15 15 15.5 D 0.5 16 16 16.5 D 0.5 17 17 16.5 D 1.5 18 18 15.5 D 1.5 19 19 15.7 A 1.6
Create a KMeans instance:
>>> km = clustering.KMeans(n_clusters=4, init='first_k', ... max_iter=100, tol=1.0E-6, thread_ratio=0.2, ... distance_level='Euclidean', ... category_weights=0.5)
Perform fit_predict:
>>> labels = km.fit_predict(data=df, 'ID') >>> labels.collect() ID CLUSTER_ID DISTANCE SLIGHT_SILHOUETTE 0 0 0 0.891088 0.944370 1 1 0 0.863917 0.942478 2 2 0 0.806252 0.946288 3 3 0 0.835684 0.944942 4 4 0 0.744571 0.950234 5 5 3 0.891088 0.940733 6 6 3 0.835684 0.944412 7 7 3 0.806252 0.946519 8 8 3 0.863917 0.946121 9 9 3 0.744571 0.949899 10 10 2 0.825527 0.945092 11 11 2 0.933886 0.937902 12 12 2 0.881692 0.945008 13 13 2 0.764318 0.949160 14 14 2 0.923456 0.939283 15 15 1 0.901684 0.940436 16 16 1 0.976885 0.939386 17 17 1 0.818178 0.945878 18 18 1 0.722799 0.952170 19 19 1 1.102342 0.925679
Input dataframe df for Accelerated K-Means :
>>> df = conn.table("PAL_ACCKMEANS_DATA_TBL") >>> df.collect() ID V000 V001 V002 0 0 0.5 A 0 1 1 1.5 A 0 2 2 1.5 A 1 3 3 0.5 A 1 4 4 1.1 B 1 5 5 0.5 B 15 6 6 1.5 B 15 7 7 1.5 B 16 8 8 0.5 B 16 9 9 1.2 C 16 10 10 15.5 C 15 11 11 16.5 C 15 12 12 16.5 C 16 13 13 15.5 C 16 14 14 15.6 D 16 15 15 15.5 D 0 16 16 16.5 D 0 17 17 16.5 D 1 18 18 15.5 D 1 19 19 15.7 A 1
Create Accelerated Kmeans instance:
>>> akm = clustering.KMeans(init='first_k', ... thread_ratio=0.5, n_clusters=4, ... distance_level='euclidean', ... max_iter=100, category_weights=0.5, ... categorical_variable=['V002'], ... accelerated=True)
Perform fit_predict:
>>> labels = akm.fit_predict(df=data, key='ID') >>> labels.collect() ID CLUSTER_ID DISTANCE SLIGHT_SILHOUETTE 0 0 0 1.198938 0.006767 1 1 0 1.123938 0.068899 2 2 3 0.500000 0.572506 3 3 3 0.500000 0.598267 4 4 0 0.621517 0.229945 5 5 0 1.037500 0.308333 6 6 0 0.962500 0.358333 7 7 0 0.895513 0.402992 8 8 0 0.970513 0.352992 9 9 0 0.823938 0.313385 10 10 1 1.038276 0.931555 11 11 1 1.178276 0.927130 12 12 1 1.135685 0.929565 13 13 1 0.995685 0.934165 14 14 1 0.849615 0.944359 15 15 1 0.995685 0.934548 16 16 1 1.135685 0.929950 17 17 1 1.089615 0.932769 18 18 1 0.949615 0.937555 19 19 1 0.915565 0.937717
- Attributes
- labels_DataFrame
Label assigned to each sample.
- cluster_centers_DataFrame
Coordinates of cluster centers.
- model_DataFrame
Model content.
- statistics_DataFrame
Statistic value.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, categorical_variable])Fit the model when given training dataset.
fit_predict
(data[, key, features, ...])Fit with the dataset and return the labels.
predict
(data[, key, features])Assign clusters to data based on a fitted model.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, categorical_variable=None)
Fit the model when given training dataset.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column. Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of feature columns.
If
features
is not provided, it defaults to all non-key columns.- categorical_variablestr or a list of str, optional
Specifies INTEGER column(s) that should be treated as categorical.
Other INTEGER columns will be treated as continuous.
Defaults to None.
- Returns
- A fitted object of class "KMeans".
- fit_predict(data, key=None, features=None, categorical_variable=None)
Fit with the dataset and return the labels.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of feature columns.
If
features
is not provided, it defaults to all non-key columns.- categorical_variablestr or a list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
Defaults to None.
- Returns
- DataFrame
Label assigned to each sample.
- predict(data, key=None, features=None)
Assign clusters to data based on a fitted model.
The output structure of this method does not match that of fit_predict().
- Parameters
- dataDataFrame
Data points to match against computed clusters.
This dataframe's column structure should match that of the data used for fit().
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional.
Names of feature columns.
If
features
is not provided, it defaults to all non-key columns.
- Returns
- DataFrame
Cluster assignment results, with 3 columns:
Data point ID, with name and type taken from the input ID column.
CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.
DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.
- create_model_state(model=None, function=None, pal_funcname='PAL_CLUSTER_ASSIGNMENT', state_description=None, force=False)
Create PAL model state.
- Parameters
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for cluster assignment.
- pal_funcnameint or str, optional
PAL function name. Must be a valid PAL procedure that supports model state.
Defaults to 'PAL_CLUSTER_ASSIGNMENT'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- set_model_state(state)
Set the model state by state information.
- Parameters
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.