outlier_detection_kmeans
- hana_ml.algorithms.pal.clustering.outlier_detection_kmeans(data, key=None, features=None, n_clusters=None, distance_level=None, contamination=None, sum_distance=True, init=None, max_iter=None, normalization=None, tol=None, distance_threshold=None, thread_number=None)
Outlier detection based on k-means clustering. It uses the K-means algorithm to find the farthest point from the centroid as an outlier.
- Parameters:
- dataDataFrame
Input data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresstr or ListOfStrings
Names of the features columns in
data
that are used for calculating distances of points indata
for clustering.Feature columns must be numerical.
Defaults to all non-key columns if not provided.
- n_clustersint, optional
Number of clusters to be grouped.
If this number is not specified, the G-means method will be used to determine the number of clusters.
- distance_level{'manhattan', 'euclidean', 'minkowski'}, optional
Specifies the distance type between data points and cluster center.
'manhattan' : Manhattan distance.
'euclidean' : Euclidean distance.
'minkowski' : Minkowski distance.
Defaults to 'euclidean'.
- contaminationfloat, optional
Specifies the proportion of outliers in
data
.Expected to be a positive number no greater than 1.
Defaults to 0.1.
- sum_distancebool, optional
Specifies whether or not to use the sum distance of a point to all cluster centers as its distance value for outlier score. If False, only the distance of a point to the center it belongs to is used its distance value calculation.
Defaults to True.
- init{'first_k', 'replace', 'no_replace', 'patent'}, optional
Controls how the initial centers are selected:
'first_k': First k observations.
'replace': Random with replacement.
'no_replace': Random without replacement.
'patent': Patent of selecting the init center (US 6,882,998 B1).
Defaults to 'patent'.
- max_iterint, optional
Maximum number of iterations for k-means clustering.
Defaults to 100.
- normalization{'no', 'l1_norm', 'min_max'}, optional
Normalization type.
'no': No normalization will be applied.
'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.
'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).
Defaults to 'no'.
- tolfloat, optional
Convergence threshold for exiting iterations in k-means clustering.
Defaults to 1.0e-9.
- distance_thresholdfloat, optional
Specifies the threshold distance value for outlier detection.
A point with distance value no greater than the threshold is not considered to be outlier.
Defaults to -1.
- thread_numberint, optional
Specifies the number of threads that can be used by this function.
Defaults to 1.
- Returns:
- DataFrames
DataFrame, detected outliers, structured as follows:
1st column : ID of detected outliers in
data
.other columns : feature values for detected outliers
DataFrame, statistics of detected outliers, structured as follows:
1st column : ID of detected outliers in
data
.2nd column : ID of the corresponding cluster centers.
3rd column : Outlier score, which is the distance value.
DataFrame, centers of clusters produced by k-means algorithm, structured as follows:
1st column : ID of cluster center.
other columns : Coordinate(i.e. feature) values of cluster center.
Examples
Input data for outlier detection:
>>> df.collect() ID V000 V001 0 0 0.5 0.5 1 1 1.5 0.5 2 2 1.5 1.5 3 3 0.5 1.5 4 4 1.1 1.2 5 5 0.5 15.5 6 6 1.5 15.5 7 7 1.5 16.5 8 8 0.5 16.5 9 9 1.2 16.1 10 10 15.5 15.5 11 11 16.5 15.5 12 12 16.5 16.5 13 13 15.5 16.5 14 14 15.6 16.2 15 15 15.5 0.5 16 16 16.5 0.5 17 17 16.5 1.5 18 18 15.5 1.5 19 19 15.7 1.6 20 20 -1.0 -1.0
Invoke the function and obtain the results:
>>> outliers, stats, centers = outlier_detection_kmeans(data=df, key='ID', ... distance_level='euclidean', ... contamination=0.15, ... sum_distance=True, ... distance_threshold=3) >>> outliers.collect() ID V000 V001 0 20 -1.0 -1.0 1 16 16.5 0.5 2 12 16.5 16.5 >>> stats.collect() ID CLUSTER_ID SCORE 0 20 2 60.619864 1 16 1 54.110424 2 12 3 53.954274