OutlierDetectionTS
- class hana_ml.algorithms.pal.tsa.outlier_detection.OutlierDetectionTS(window_size=None, outlier_method=None, threshold=None, detect_seasonality=None, alpha=None, extrapolation=None, periods=None, random_state=None, n_estimators=None, max_samples=None, bootstrap=None, contamination=None, minpts=None, eps=None, thread_ratio=None)
Outlier detection for time-series.
- Parameters
- window_sizeint, optional
Odd number, the window size for median filter, not less than 3.
Defaults to 3.
- outlier_methodstr, optional
The method for calculate the outlier score from residual.
'z1' : Z1 score.
'z2' : Z2 score.
'iqr' : IQR score.
'mad' : MAD score.
'isolationforest' : isolation forest score.
'dbscan' : DBSCAN.
Defaults to 'z1'.
- thresholdfloat, optional
The threshold for outlier score. If the absolute value of outlier score is beyond the threshold, we consider the corresponding data point as an outlier.
Only valid when
outlier_method
= 'iqr', 'isolationforest', 'mad', 'z1', 'z2'. Foroutlier_method
= 'isolationforest', whencontamination
is provided,threshold
is not valid and outliers are decided bycontamination
.Defaults to 3 when
outlier_method
is 'mad', 'z1' and 'z2'. Defaults to 1.5 whenoutlier_method
is 'iqr'. Defaults to 0.7 whenoutlier_method
is 'isolationforest'.- detect_seasonalitybool, optional
When calculating the residual,
False: Does not consider the seasonal decomposition.
True: Considers the seasonal decomposition.
Defaults to False.
- alphafloat, optional
The criterion for the autocorrelation coefficient. The value range is (0, 1).
A larger value indicates a stricter requirement for seasonality.
Only valid when
detect_seasonality
is True.Defaults to 0.2.
- extrapolationbool, optional
Specifies whether to extrapolate the endpoints. Set to True when there is an end-point issue.
Only valid when
detect_seasonality
is True.Defaults to False if
auto
is False and defaults to True if auto` is True.- periodsint, optional
When this parameter is not specified, the algorithm will search the seasonal period. When this parameter is specified between 2 and half of the series length, autocorrelation value is calculated for this number of periods and the result is compared to
alpha
parameter. If correlation value is equal to or higher thanalpha
, decomposition is executed with the value ofperiods
. Otherwise, the residual is calculated without decomposition. For other value of parameterperiods
, the residual is also calculated without decomposition.Only valid when
detect_seasonality
is True. If the user knows the seasonal period, specifyingperiods
can speed up the calculation, especially when the time series is long.No Default value.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time (in second) as seed.
Others: Uses the specified value as seed.
Only valid when
outlier_method
is 'isolationforest'.Default to 0.
- n_estimatorsint, optional
Specifies the number of trees to grow.
Only valid when
outlier_method
is 'isolationforest'.Default to 100.
- max_samplesint, optional
Specifies the number of samples to draw from input to train each tree. If
max_samples
is larger than the number of samples provided, all samples will be used for all trees.Only valid when
outlier_method
is 'isolationforest'.Default to 256.
- bootstrapbool, optional
Specifies sampling method.
False: Sampling without replacement.
True: Sampling with replacement.
Only valid when
outlier_method
is 'isolationforest'.Default to False.
- contaminationdouble, optional
The proportion of outliers in the data set. Should be in the range (0, 0.5].
Only valid when
outlier_method
is 'isolationforest'. Whenoutlier_method
is 'isolationforest' andcontamination
is specified,threshold
is not valid.No Default value.
- minptsint, optional
Specifies the minimum number of points required to form a cluster. The point itself is not included in
minpts
.Only valid when
outlier_method
is 'dbscan'.Defaults to 1.
- epsfloat, optional
Specifies the scan radius.
Only valid when
outlier_method
is 'dbscan'.Defaults to 0.5.
- thread_ratiofloat, optional
The ratio of available threads.
0: single thread.
0~1: percentage.
Others: heuristically determined.
Only valid when
detect_seasonality
is True oroutlier_method
is 'isolationforest' or 'dbscan'.Defaults to -1.
References
Outlier detection methods implemented in this class are commonly consisted of two steps:
Please refer to the above links for detailed description of all methods as well as related parameters.
Examples
Time series DataFrame df:
>>> df.collect().head() ID RAW_DATA 0 1 2.0 1 2 2.5 2 3 3.2 3 4 2.8 ...... 14 15 5.3 15 16 10.0 16 17 4.6 17 18 4.4 18 19 4.8 19 20 5.1
Initialize the class:
>>> tsod = OutlierDetectionTS(detect_seasonality=False, outlier_method='z1', window_size=3, threshold=3.0) >>> res = tsod.fit_predict(data=df, key='ID', endog='RAW_DATA')
Outputs and attributes:
>>> res.collect() TIMESTAMP RAW_DATA RESIDUAL OUTLIER_SCORE IS_OUTLIER 0 1 2.0 0.0 -0.297850 0 1 2 2.5 0.0 -0.297850 0 2 3 3.2 0.4 -0.010766 0 ...... 13 14 5.1 0.0 -0.297850 0 14 15 5.3 0.0 -0.297850 0 15 16 10.0 4.7 3.075387 1 16 17 4.6 0.0 -0.297850 0 17 18 4.4 -0.2 -0.441392 0 18 19 4.8 0.0 -0.297850 0 19 20 5.1 0.0 -0.297850 0
>>> tsod.stats_.collect() STAT_NAME STAT_VALUE 0 DETECT_SEASONALITY 0 1 OutlierNum 1 2 Mean 0.415 3 Standard Deviation 1.39332 4 HandleZero 0
- Attributes
- stats_DataFrame
Data statistics, structured as follows:
STAT_NAME : Name of statistics.
STAT_VALUE : Value of statistics.
- metrics_DataFrame
Relevant metrics for time-series outlier detection, structured as follows:
NAME : Metric name.
VALUE : Metric value.
Methods
fit_predict
(data[, key, endog])Detection of outliers in time-series data.
- fit_predict(data, key=None, endog=None)
Detection of outliers in time-series data.
- Parameters
- dataDataFrame
Input data containing the target time-series.
data
should have at least two columns: one is ID column, the other is raw data.- keystr, optional
Specifies the ID column, in this case the column that shows the order of time-series.
It is recommended that you always specifies this column manually.
Defaults to the first column of data if the index column of data is not provided. Otherwise, defaults to the index column of data.
- endogstr, optional
Specifies the column that contains the values of time-series to be tested.
Defaults to the first non-key column.
- Returns
- DataFrame
- Outlier detection result, structured as follows:
TIMESTAMP : ID of data.
RAW_DATA : Original value.
RESIDUAL : Residual.
OUTLIER_SCORE : Outlier score.
IS_OUTLIER : 0: normal, 1: outlier.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
Inherited Methods from PALBase
Besides those methods mentioned above, the OutlierDetectionTS class also inherits methods from PALBase class, please refer to PAL Base for more details.