KMeansOutlier
- class hana_ml.algorithms.pal.clustering.KMeansOutlier(n_clusters=None, distance_level=None, contamination=None, sum_distance=True, init=None, max_iter=None, normalization=None, tol=None, distance_threshold=None)
Outlier detection of datasets using k-means clustering.
- Parameters:
- n_clustersint, optional
Number of clusters to be grouped.
If this number is not specified, the G-means method will be used to determine the number of clusters.
- distance_level{'manhattan', 'euclidean', 'minkowski'}, optional
Specifies the distance type between data points and cluster center.
'manhattan' : Manhattan distance
'euclidean' : Euclidean distance
'minkowski' : Minkowski distance
Defaults to 'euclidean'.
- contaminationfloat, optional
Specifies the proportion of outliers within the input data to be detected.
Expected to be a positive number no greater than 1.
Defaults to 0.1.
- sum_distancebool, optional
Specifies whether or not to use the sum distance of a point to all cluster centers as its distance value for outlier score. If False, only the distance of a point to the center it belongs to is used its distance value calculation.
Defaults to True.
- init{'first_k', 'replace', 'no_replace', 'patent'}, optional
Controls how the initial centers are selected:
'first_k': First k observations.
'replace': Random with replacement.
'no_replace': Random without replacement.
'patent': Patent of selecting the init center (US 6,882,998 B1).
Defaults to 'patent'.
- max_iterint, optional
Maximum number of iterations for k-means clustering.
Defaults to 100.
- normalization{'no', 'l1_norm', 'min_max'}, optional
Normalization type.
'no': No normalization will be applied.
'l1_norm': Yes, for each point X (x1, x2, ..., xn), the normalized value will be X'(x1 /S,x2 /S,...,xn /S), where S = |x1|+|x2|+...|xn|.
'min_max': Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).
Defaults to 'no'.
- tolfloat, optional
Convergence threshold for exiting iterations in k-means clustering.
Defaults to 1.0e-9.
- distance_thresholdfloat, optional
Specifies the threshold distance value for outlier detection.
A point with distance value no greater than the threshold is not considered to be outlier.
Defaults to -1.
Examples
Input data for outlier detection:
>>> df.collect() ID V000 V001 0 0 0.5 0.5 1 1 1.5 0.5 2 2 1.5 1.5 3 3 0.5 1.5 4 4 1.1 1.2 5 5 0.5 15.5 6 6 1.5 15.5 7 7 1.5 16.5 8 8 0.5 16.5 9 9 1.2 16.1 10 10 15.5 15.5 11 11 16.5 15.5 12 12 16.5 16.5 13 13 15.5 16.5 14 14 15.6 16.2 15 15 15.5 0.5 16 16 16.5 0.5 17 17 16.5 1.5 18 18 15.5 1.5 19 19 15.7 1.6 20 20 -1.0 -1.0
Initialize the class instance
>>> kmsodt = KMeansOutlier(distance_level='euclidean', ... contamination=0.15, ... sum_distance=True, ... distance_threshold=3) >>> outliers, stats, centers = kmsodt.fit_predict(df, key='ID') >>> outliers.collect() ID V000 V001 0 20 -1.0 -1.0 1 16 16.5 0.5 2 12 16.5 16.5 >>> stats.collect() ID CLUSTER_ID SCORE 0 20 2 60.619864 1 16 1 54.110424 2 12 3 53.954274
- Attributes:
fit_hdbprocedure
Returns the generated hdbprocedure for fit.
predict_hdbprocedure
Returns the generated hdbprocedure for predict.
Methods
fit_predict
(data[, key, features, thread_number])Performing k-means clustering on an input dataset and extracting the corresponding outliers.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- fit_predict(data, key=None, features=None, thread_number=None)
Performing k-means clustering on an input dataset and extracting the corresponding outliers.
- Parameters:
- dataDataFrame
Input data for outlier detection using k-means clustering.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresstr or a list of str
Names of the features columns in
data
that are used for calculating distances of points indata
for clustering.Feature columns must be numerical.
Defaults to all non-key columns if not provided.
- thread_numberint, optional
Specifies the number of threads that can be used by this function.
Defaults to 1.
- Returns:
- DataFrame
DataFrame 1, detected outliers, structured as follows:
1st column : ID of detected outliers in
data
.other columns : feature values for detected outliers
DataFrame 2, statistics of detected outliers, structured as follows:
1st column : ID of detected outliers in
data
.2nd column : ID of the corresponding cluster centers.
3rd column : Outlier score, which is the distance value.
DataFrame 3, centers of clusters produced by k-means algorithm, structured as follows:
1st column : ID of cluster center.
other columns : Coordinate(i.e. feature) values of cluster center.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
Inherited Methods from PALBase
Besides those methods mentioned above, the KMeansOutlier class also inherits methods from PALBase class, please refer to PAL Base for more details.