class hana_ml.algorithms.pal.clustering.KMeansOutlier(n_clusters=None, distance_level=None, contamination=None, sum_distance=True, init=None, max_iter=None, normalization=None, tol=None, distance_threshold=None)

Outlier detection based on k-means clustering. It uses the K-means algorithm to find the farthest point from the centroid as an outlier.

n_clustersint, optional

Number of clusters to be grouped.

If this number is not specified, the G-means method will be used to determine the number of clusters.

distance_level{'manhattan', 'euclidean', 'minkowski'}, optional

Specifies the distance type between data points and cluster center.

  • 'manhattan' : Manhattan distance

  • 'euclidean' : Euclidean distance

  • 'minkowski' : Minkowski distance

Defaults to 'euclidean'.

contaminationfloat, optional

Specifies the proportion of outliers within the input data to be detected.

Expected to be a positive number no greater than 1.

Defaults to 0.1.

sum_distancebool, optional

Specifies whether or not to use the sum distance of a point to all cluster centers as its distance value for outlier score. If False, only the distance of a point to the center it belongs to is used its distance value calculation.

Defaults to True.

init{'first_k', 'replace', 'no_replace', 'patent'}, optional

Controls how the initial centers are selected:

  • 'first_k': First k observations.

  • 'replace': Random with replacement.

  • 'no_replace': Random without replacement.

  • 'patent': Patent of selecting the init center (US 6,882,998 B1).

Defaults to 'patent'.

max_iterint, optional

Maximum number of iterations for k-means clustering.

Defaults to 100.

normalization{'no', 'l1_norm', 'min_max'}, optional

Normalization type.

  • 'no': No normalization will be applied.

  • 'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.

  • 'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to 'no'.

tolfloat, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-9.

distance_thresholdfloat, optional

Specifies the threshold distance value for outlier detection.

A point with distance value no greater than the threshold is not considered to be outlier.

Defaults to -1.


Input data df:

>>> df.collect()
    ID  V000  V001
0    0   0.5   0.5
1    1   1.5   0.5
19  19  15.7   1.6
20  20  -1.0  -1.0

Initialize a KMeansOutlier instance

>>> kmsodt = KMeansOutlier(distance_level='euclidean',
...                        contamination=0.15,
...                        sum_distance=True,
...                        distance_threshold=3)
>>> outliers, stats, centers = kmsodt.fit_predict(data=df, key='ID')
>>> outliers.collect()
   ID  V000  V001
0  20  -1.0  -1.0
1  16  16.5   0.5
2  12  16.5  16.5
>>> stats.collect()
0  20           2  60.619864
1  16           1  54.110424
2  12           3  53.954274


fit_predict(data[, key, features, thread_number])

Performing k-means clustering on an input dataset and extracting the corresponding outliers.

fit_predict(data, key=None, features=None, thread_number=None)

Performing k-means clustering on an input dataset and extracting the corresponding outliers.


Input data for outlier detection.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr or a list of str

Names of the features columns in data that are used for calculating distances of points in data for clustering.

Feature columns must be numerical.

Defaults to all non-key columns if not provided.

thread_numberint, optional

Specifies the number of threads that can be used by this function.

Defaults to 1.


DataFrame 1, detected outliers, structured as follows:

  • 1st column : ID of detected outliers in data.

  • other columns : feature values for detected outliers

DataFrame 2, statistics of detected outliers, structured as follows:

  • 1st column : ID of detected outliers in data.

  • 2nd column : ID of the corresponding cluster centers.

  • 3rd column : Outlier score, which is the distance value.

DataFrame 3, centers of clusters produced by k-means algorithm, structured as follows:

  • 1st column : ID of cluster center.

  • other columns : Coordinate(i.e. feature) values of cluster center.

Inherited Methods from PALBase

Besides those methods mentioned above, the KMeansOutlier class also inherits methods from PALBase class, please refer to PAL Base for more details.