SlightSilhouette

hana_ml.algorithms.pal.clustering.SlightSilhouette(data, features=None, label=None, distance_level=None, minkowski_power=None, normalization=None, thread_number=None, categorical_variable=None, category_weights=None)

Silhouette refers to a method used to validate the cluster of data. SAP HNAN PAL provides a light version of silhouette called slight silhouette. SlightSihouette is an wrapper for this light version silhouette method.

Parameters:
dataDataFrame

DataFrame containing the data.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-label columns.

label: str, optional

Name of the ID column.

If label is not provided, it defaults to the last column of data.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center. 'cosine' is only valid when accelerated is False.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No normalization will be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

thread_numberint, optional

Number of threads.

Defaults to 1.

categorical_variablestr or a list of str, optional

Indicates whether or not a column of data is actually corresponding to a category variable even the data type of this column is INTEGER.

By default, VARCHAR or NVARCHAR is a category variable, and INTEGER or DOUBLE is a continuous variable.

Defaults to None.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

Returns:
DataFrame

A DataFrame containing the validation value of Slight Silhouette.

Examples

Input dataframe df:

>>> df.collect()
    V000 V001 V002 CLUSTER
0    0.5    A  0.5       0
1    1.5    A  0.5       0
2    1.5    A  1.5       0
3    0.5    A  1.5       0
4    1.1    B  1.2       0
5    0.5    B 15.5       1
6    1.5    B 15.5       1
7    1.5    B 16.5       1
8    0.5    B 16.5       1
9    1.2    C 16.1       1
10  15.5    C 15.5       2
11  16.5    C 15.5       2
12  16.5    C 16.5       2
13  15.5    C 16.5       2
14  15.6    D 16.2       2
15  15.5    D  0.5       3
16  16.5    D  0.5       3
17  16.5    D  1.5       3
18  15.5    D  1.5       3
19  15.7    A  1.6       3

Call the function:

>>> res = SlightSilhouette(df, label="CLUSTER")

Result:

>>> res.collect()
  VALIDATE_VALUE
0      0.9385944