outlier_detection_kmeans

hana_ml.algorithms.pal.clustering.outlier_detection_kmeans(data, key=None, features=None, n_clusters=None, distance_level=None, contamination=None, sum_distance=True, init=None, max_iter=None, normalization=None, tol=None, distance_threshold=None, thread_number=None)

Outlier detection based on k-means clustering.

Parameters:

dataDataFrame

Input data for outlier detection using k-means clustering.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr or ListOfStrings

Names of the features columns in data that are used for calculating distances of points in data for clustering.

Feature columns must be numerical.

Defaults to all non-key columns if not provided.

n_clustersint, optional

Number of clusters to be grouped.

If this number is not specified, the G-means method will be used to determine the number of clusters.

distance_level{'manhattan', 'euclidean', 'minkowski'}, optional

Specifies the distance type between data points and cluster center.

'manhattan' : Manhattan distance

'euclidean' : Euclidean distance

'minkowski' : Minkowski distance

Defaults to 'euclidean'.

contaminationfloat, optional

Specifies the proportion of outliers in data.

Expected to be a positive number no greater than 1.

Defaults to 0.1.

sum_distancebool, optional

Specifies whether or not to use the sum distance of a point to all cluster centers as its distance value for outlier score. If False, only the distance of a point to the center it belongs to is used its distance value calculation.

Defaults to True.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

'first_k': First k observations.

'replace': Random with replacement.

'no_replace': Random without replacement.

'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Maximum number of iterations for k-means clustering.

Defaults to 100.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

'no': No normalization will be applied.

'l1_norm': Yes, for each point X (x₁, x₂, ..., x_n), the normalized value will be X'(x₁ /S,x₂ /S,...,x_n /S), where S = |x₁|+|x₂|+...|x_n|.

'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

tolfloat, optional

Convergence threshold for exiting iterations in k-means clustering.

Defaults to 1.0e-9.

distance_thresholdfloat, optional

Specifies the threshold distance value for outlier detection.

A point with distance value no greater than the threshold is not considered to be outlier.

Defaults to -1.

thread_numberint, optional

Specifies the number of threads that can be used by this function.

Defaults to 1.

Returns:

DataFrame

DataFrame 1, detected outliers, structured as follows:

1st column : ID of detected outliers in data.

other columns : feature values for detected outliers

DataFrame 2, statistics of detected outliers, structured as follows:

1st column : ID of detected outliers in data.

2nd column : ID of the corresponding cluster centers.

3rd column : Outlier score, which is the distance value.

DataFrame 3, centers of clusters produced by k-means algorithm, structured as follows:

1st column : ID of cluster center.

other columns : Coordinate(i.e. feature) values of cluster center.

Examples

Input data for outlier detection:

>>> df.collect()
    ID  V000  V001
  0   0.5   0.5
  1   1.5   0.5
  2   1.5   1.5
  3   0.5   1.5
  4   1.1   1.2
  5   0.5  15.5
  6   1.5  15.5
  7   1.5  16.5
  8   0.5  16.5
  9   1.2  16.1
10  15.5  15.5
11  16.5  15.5
12  16.5  16.5
13  15.5  16.5
14  15.6  16.2
15  15.5   0.5
16  16.5   0.5
17  16.5   1.5
18  15.5   1.5
19  15.7   1.6
20  -1.0  -1.0

>>> outliers, stats, centers = outlier_detection_kmeans(df, key='ID',
...                                                     distance_level='euclidean',
...                                                     contamination=0.15,
...                                                     sum_distance=True,
...                                                     distance_threshold=3)
>>> outliers.collect()
   ID  V000  V001
0  20  -1.0  -1.0
1  16  16.5   0.5
2  12  16.5  16.5
>>> stats.collect()
   ID  CLUSTER_ID      SCORE
0  20           2  60.619864
1  16           1  54.110424
2  12           3  53.954274