KMedians
- class hana_ml.algorithms.pal.clustering.KMedians(n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)
K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medians of each feature to calculate cluster centers.
- Parameters
- n_clustersint
Number of groups.
- init{'first_k', 'replace', 'no_replace', 'patent'}, optional
Controls how the initial centers are selected:
'first_k': First k observations.
'replace': Random with replacement.
'no_replace': Random without replacement.
'patent': Patent of selecting the init center (US 6,882,998 B1).
Defaults to 'patent'.
- max_iterint, optional
Max iterations.
Defaults to 100.
- tolfloat, optional
Convergence threshold for exiting iterations.
Defaults to 1.0e-6.
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional
Ways to compute the distance between the item and the cluster center.
Defaults to 'euclidean'.
- minkowski_powerfloat, optional
When Minkowski distance is used, this parameter controls the value of power.
Only valid when
distance_level
is 'minkowski'.Defaults to 3.0.
- category_weightsfloat, optional
Represents the weight of category attributes.
Defaults to 0.707.
- normalization{'no', 'l1_norm', 'min_max'}, optional
Normalization type.
'no': No, normalization will not be applied.
'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1/S,x2/S,...,xn/S), where S = |x1|+|x2|+...|xn|.
'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).
Defaults to 'no'.
- categorical_variablestr or a list of str, optional
Specifies INTEGER column(s) in the data that should be treated as categorical.
Defaults to None.
Examples
Input dataframe df1 for clustering:
>>> df1.collect() ID V000 V001 V002 0 0 0.5 A 0.5 1 1 1.5 A 0.5 2 2 1.5 A 1.5 3 3 0.5 A 1.5 4 4 1.1 B 1.2 5 5 0.5 B 15.5 6 6 1.5 B 15.5 7 7 1.5 B 16.5 8 8 0.5 B 16.5 9 9 1.2 C 16.1 10 10 15.5 C 15.5 11 11 16.5 C 15.5 12 12 16.5 C 16.5 13 13 15.5 C 16.5 14 14 15.6 D 16.2 15 15 15.5 D 0.5 16 16 16.5 D 0.5 17 17 16.5 D 1.5 18 18 15.5 D 1.5 19 19 15.7 A 1.6
Creating KMedians instance:
>>> kmedians = KMedians(n_clusters=4, init='first_k', ... max_iter=100, tol=1.0E-6, ... distance_level='Euclidean', ... thread_ratio=0.3, category_weights=0.5)
Performing fit() on given dataframe:
>>> kmedians.fit(data=df1, key='ID') >>> kmedians.cluster_centers_.collect() CLUSTER_ID V000 V001 V002 0 0 1.1 A 1.2 1 1 15.7 D 1.5 2 2 15.6 C 16.2 3 3 1.2 B 16.1
Performing fit_predict() on given dataframe:
>>> kmedians.fit_predict(data=df1, key='ID').collect() ID CLUSTER_ID DISTANCE 0 0 0 0.921954 1 1 0 0.806226 2 2 0 0.500000 3 3 0 0.670820 4 4 0 0.707107 5 5 3 0.921954 6 6 3 0.670820 7 7 3 0.500000 8 8 3 0.806226 9 9 3 0.707107 10 10 2 0.707107 11 11 2 1.140175 12 12 2 0.948683 13 13 2 0.316228 14 14 2 0.707107 15 15 1 1.019804 16 16 1 1.280625 17 17 1 0.800000 18 18 1 0.200000 19 19 1 0.807107
- Attributes
- cluster_centers_DataFrame
Coordinates of cluster centers.
- labels_DataFrame
Cluster assignment and distance to cluster center for each point.
Methods
fit
(data[, key, features, categorical_variable])Perform clustering on input dataset.
fit_predict
(data[, key, features, ...])Perform clustering algorithm and return labels.
- fit(data, key=None, features=None, categorical_variable=None)
Perform clustering on input dataset.
- Parameters
- dataDataFrame
DataFrame contains input data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of feature columns.
If
features
is not provided, it defaults to all non-key columns.- categorical_variablestr or a list of str, optional
Specifies INTEGER columns that should be treated as categorical. Other INTEGER columns will be treated as continuous.
Defaults to None.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- fit_predict(data, key=None, features=None, categorical_variable=None)
Perform clustering algorithm and return labels.
- Parameters
- dataDataFrame
DataFrame containing input data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of feature columns.
If
features
is not provided, it defaults to all non-key columns.- categorical_variablestr or a list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
Defaults to None.
- Returns
- DataFrame
Fit result, structured as follows:
ID column, with the same name and type as
data
's ID column.CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.
DISTANCE, type DOUBLE, the distance between the given point and the cluster center.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.