SlightSilhouette

hana_ml.algorithms.pal.clustering.SlightSilhouette(data, features=None, label=None, distance_level=None, minkowski_power=None, normalization=None, thread_number=None, categorical_variable=None, category_weights=None)

Silhouette refers to a method used to validate the cluster of data. SAP HNAN PAL provides a light version of silhouette called slight silhouette. SlightSihouette is an wrapper for this light version silhouette method.

Parameters:

dataDataFrame

DataFrame containing the data.

featuresa list of str, optional

Names of feature columns.

If features is not provided, it defaults to all non-label columns.

label: str, optional

Name of the ID column.

If label is not provided, it defaults to the last column of data.

distance_level{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}, optional

Ways to compute the distance between the item and the cluster center. 'cosine' is only valid when accelerated is False.

Defaults to 'euclidean'.

minkowski_powerfloat, optional

When Minkowski distance is used, this parameter controls the value of power.

Only valid when distance_level is 'minkowski'.

Defaults to 3.0.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

'no': No normalization will be applied.

'l1_norm': Yes, for each point X (x₁, x₂, ..., x_n), the normalized value will be X'(x₁ /S,x₂ /S,...,x_n /S), where S = |x₁|+|x₂|+...|x_n|.

'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

thread_numberint, optional

Number of threads.

Defaults to 1.

categorical_variablestr or a list of str, optional

Indicates whether or not a column of data is actually corresponding to a category variable even the data type of this column is INTEGER.

By default, VARCHAR or NVARCHAR is a category variable, and INTEGER or DOUBLE is a continuous variable.

Defaults to None.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

Returns:

DataFrame: A DataFrame containing the validation value of Slight Silhouette.

Examples

Input dataframe df:

>>> df.collect()
    V000 V001 V002 CLUSTER
  0.5    A  0.5       0
  1.5    A  0.5       0
  1.5    A  1.5       0
  0.5    A  1.5       0
  1.1    B  1.2       0
  0.5    B 15.5       1
  1.5    B 15.5       1
  1.5    B 16.5       1
  0.5    B 16.5       1
  1.2    C 16.1       1
15.5    C 15.5       2
16.5    C 15.5       2
16.5    C 16.5       2
15.5    C 16.5       2
15.6    D 16.2       2
15.5    D  0.5       3
16.5    D  0.5       3
16.5    D  1.5       3
15.5    D  1.5       3
15.7    A  1.6       3

Call the function:

>>> res = SlightSilhouette(df, label="CLUSTER")

Result:

>>> res.collect()
  VALIDATE_VALUE
0      0.9385944