SlightSilhouette

hana_ml.algorithms.pal.clustering.SlightSilhouette(data, features=None, label=None, distance_level=None, minkowski_power=None, normalization=None, thread_number=None, categorical_variable=None, category_weights=None)

Silhouette refers to a method used to validate the cluster of data. SAP HNAN PAL provides a light version of silhouette called slight silhouette. SlightSihouette is an wrapper for this light version silhouette method.

Parameters
dataDataFrame

DataFrame containing the data.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-label columns.

label: str, optional

Name of the ID column.

If label is not provided, it defaults to the last column of data.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center. 'cosine' is only valid when accelerated is False.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No normalization will be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

thread_numberint, optional

Number of threads.

Defaults to 1.

categorical_variablestr or a list of str, optional

Indicates whether or not a column of data is actually corresponding to a category variable even the data type of this column is INTEGER.

By default, VARCHAR or NVARCHAR is a category variable, and INTEGER or DOUBLE is a continuous variable.

Defaults to None.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

Returns
DataFrame

A DataFrame containing the validation value of Slight Silhouette.

Examples

Input dataframe df:

>>> df.collect()
    V000 V001 V002 CLUSTER
0    0.5    A  0.5       0
1    1.5    A  0.5       0
2    1.5    A  1.5       0
3    0.5    A  1.5       0
4    1.1    B  1.2       0
5    0.5    B 15.5       1
6    1.5    B 15.5       1
7    1.5    B 16.5       1
8    0.5    B 16.5       1
9    1.2    C 16.1       1
10  15.5    C 15.5       2
11  16.5    C 15.5       2
12  16.5    C 16.5       2
13  15.5    C 16.5       2
14  15.6    D 16.2       2
15  15.5    D  0.5       3
16  16.5    D  0.5       3
17  16.5    D  1.5       3
18  15.5    D  1.5       3
19  15.7    A  1.6       3

Call the function:

>>> res = SlightSilhouette(df, label="CLUSTER")

Result:

>>> res.collect()
  VALIDATE_VALUE
0      0.9385944