SlightSilhouette
- hana_ml.algorithms.pal.clustering.SlightSilhouette(data, features=None, label=None, distance_level=None, minkowski_power=None, normalization=None, thread_number=None, categorical_variable=None, category_weights=None)
Silhouette refers to a method used to validate the cluster of data. SAP HNAN PAL provides a light version of silhouette called slight silhouette. SlightSihouette is an wrapper for this light version silhouette method.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- featuresa list of str, optional
Names of feature columns.
If
features
is not provided, it defaults to all non-label columns.- label: str, optional
Name of the ID column.
If
label
is not provided, it defaults to the last column of data.- distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional
Ways to compute the distance between the item and the cluster center. 'cosine' is only valid when
accelerated
is False.Defaults to 'euclidean'.
- minkowski_powerfloat, optional
When Minkowski distance is used, this parameter controls the value of power.
Only valid when
distance_level
is 'minkowski'.Defaults to 3.0.
- normalization{'no', 'l1_norm', 'min_max'}, optional
Normalization type.
'no': No normalization will be applied.
'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.
'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).
Defaults to 'no'.
- thread_numberint, optional
Number of threads.
Defaults to 1.
- categorical_variablestr or a list of str, optional
Indicates whether or not a column of data is actually corresponding to a category variable even the data type of this column is INTEGER.
By default, VARCHAR or NVARCHAR is a category variable, and INTEGER or DOUBLE is a continuous variable.
Defaults to None.
- category_weightsfloat, optional
Represents the weight of category attributes.
Defaults to 0.707.
- Returns:
- DataFrame
A DataFrame containing the validation value of Slight Silhouette.
Examples
Input dataframe df:
>>> df.collect() V000 V001 V002 CLUSTER 0 0.5 A 0.5 0 1 1.5 A 0.5 0 2 1.5 A 1.5 0 3 0.5 A 1.5 0 4 1.1 B 1.2 0 5 0.5 B 15.5 1 6 1.5 B 15.5 1 7 1.5 B 16.5 1 8 0.5 B 16.5 1 9 1.2 C 16.1 1 10 15.5 C 15.5 2 11 16.5 C 15.5 2 12 16.5 C 16.5 2 13 15.5 C 16.5 2 14 15.6 D 16.2 2 15 15.5 D 0.5 3 16 16.5 D 0.5 3 17 16.5 D 1.5 3 18 15.5 D 1.5 3 19 15.7 A 1.6 3
Call the function:
>>> res = SlightSilhouette(df, label="CLUSTER")
Result:
>>> res.collect() VALIDATE_VALUE 0 0.9385944