KMeansOutlier

class hana_ml.algorithms.pal.clustering.KMeansOutlier(n_clusters=None, distance_level=None, contamination=None, sum_distance=True, init=None, max_iter=None, normalization=None, tol=None, distance_threshold=None)

Outlier detection of datasets using k-means clustering.

Parameters:
n_clustersint, optional

Number of clusters to be grouped.

If this number is not specified, the G-means method will be used to determine the number of clusters.

distance_level{'manhattan', 'euclidean', 'minkowski'}, optional

Specifies the distance type between data points and cluster center.

  • 'manhattan' : Manhattan distance

  • 'euclidean' : Euclidean distance

  • 'minkowski' : Minkowski distance

Defaults to 'euclidean'.

contaminationfloat, optional

Specifies the proportion of outliers within the input data to be detected.

Expected to be a positive number no greater than 1.

Defaults to 0.1.

sum_distancebool, optional

Specifies whether or not to use the sum distance of a point to all cluster centers as its distance value for outlier score. If False, only the distance of a point to the center it belongs to is used its distance value calculation.

Defaults to True.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Maximum number of iterations for k-means clustering.

Defaults to 100.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No normalization will be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

tolfloat, optional

Convergence threshold for exiting iterations in k-means clustering.

Defaults to 1.0e-9.

distance_thresholdfloat, optional

Specifies the threshold distance value for outlier detection.

A point with distance value no greater than the threshold is not considered to be outlier.

Defaults to -1.

Examples

Input data for outlier detection:

>>> df.collect()
    ID  V000  V001
0    0   0.5   0.5
1    1   1.5   0.5
2    2   1.5   1.5
3    3   0.5   1.5
4    4   1.1   1.2
5    5   0.5  15.5
6    6   1.5  15.5
7    7   1.5  16.5
8    8   0.5  16.5
9    9   1.2  16.1
10  10  15.5  15.5
11  11  16.5  15.5
12  12  16.5  16.5
13  13  15.5  16.5
14  14  15.6  16.2
15  15  15.5   0.5
16  16  16.5   0.5
17  17  16.5   1.5
18  18  15.5   1.5
19  19  15.7   1.6
20  20  -1.0  -1.0

Initialize the class instance

>>> kmsodt = KMeansOutlier(distance_level='euclidean',
...                        contamination=0.15,
...                        sum_distance=True,
...                        distance_threshold=3)
>>> outliers, stats, centers = kmsodt.fit_predict(df, key='ID')
>>> outliers.collect()
   ID  V000  V001
0  20  -1.0  -1.0
1  16  16.5   0.5
2  12  16.5  16.5
>>> stats.collect()
   ID  CLUSTER_ID      SCORE
0  20           2  60.619864
1  16           1  54.110424
2  12           3  53.954274
Attributes:
fit_hdbprocedure

Returns the generated hdbprocedure for fit.

predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Methods

fit_predict(data[, key, features, thread_number])

Performing k-means clustering on an input dataset and extracting the corresponding outliers.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

fit_predict(data, key=None, features=None, thread_number=None)

Performing k-means clustering on an input dataset and extracting the corresponding outliers.

Parameters:
dataDataFrame

Input data for outlier detection using k-means clustering.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr or a list of str

Names of the features columns in data that are used for calculating distances of points in data for clustering.

Feature columns must be numerical.

Defaults to all non-key columns if not provided.

thread_numberint, optional

Specifies the number of threads that can be used by this function.

Defaults to 1.

Returns:
DataFrame

DataFrame 1, detected outliers, structured as follows:

  • 1st column : ID of detected outliers in data.

  • other columns : feature values for detected outliers

DataFrame 2, statistics of detected outliers, structured as follows:

  • 1st column : ID of detected outliers in data.

  • 2nd column : ID of the corresponding cluster centers.

  • 3rd column : Outlier score, which is the distance value.

DataFrame 3, centers of clusters produced by k-means algorithm, structured as follows:

  • 1st column : ID of cluster center.

  • other columns : Coordinate(i.e. feature) values of cluster center.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the KMeansOutlier class also inherits methods from PALBase class, please refer to PAL Base for more details.