OutlierDetectionTS

class hana_ml.algorithms.pal.tsa.outlier_detection.OutlierDetectionTS(auto=None, detect_intermittent_ts=None, smooth_method=None, window_size=None, loess_lag=None, current_value_flag=None, outlier_method=None, threshold=None, detect_seasonality=None, alpha=None, extrapolation=None, periods=None, random_state=None, n_estimators=None, max_samples=None, bootstrap=None, contamination=None, minpts=None, eps=None, thread_ratio=None)

Outlier detection for time-series.

Parameters:

autobool, optional

True : automatic method to get residual.
False : manual method to get residual.

Defaults to True.

detect_intermittent_tsbool, optional

True : detects whether the time series is intermittent.
False : does not detect whether the time series is intermittent.

only valid when auto is True. If input data is intermittent time series, it will not do outlier detection

Defaults to False.

smooth_methodstr, optional

the method to get the residual.

'median' : median filter.
'loess' : LOESS (locally estimated scatterplot smoothing) or LOWESS (locally weighted scatterplot smoothing) is a locally weighted linear regression method. This method is applicable to the time series which is non-seasonal. This method is also suitable for non-smooth time series.
'super' : super smoother. This method combines a set of LOESS methods. Like LOESS, this method is applicable to non-seasonal time series. This method is also suitable for non-smooth time series.

only valid when auto is False.

Defaults to 'median'.

window_sizeint, optional

Odd number, the window size for median filter, not less than 3.

The value 1 means median filter is not applied. Only valid when auto is False and smooth_method is 'median'.

Defaults to 3.

loess_lagint, optional

Odd number, the lag for LOESS, not less than 3.

Only valid when auto is False and smooth_method is 'loess'.

Defaults to 7.

current_value_flagbool, optional

Whether to take the current data point when using LOESS smoothing method.

True : takes the current data point.
False : does not take the current data point.

For example, to estimate the value at time t with the window [t-3, t-2, t-1, t, t+1, t+2, t+3], taking the current data point means estimating the value at t with the real data points at [t-3, t-2, t-1, t, t+1, t+2, t+3], while not taking the current data point means estimating the value at t with the real data points at [t-3, t-2, t-1, t+1, t+2, t+3], without the real data point at t.

Only valid when auto is False and smooth_method is 'median'.

Defaults to False.

outlier_methodstr, optional

The method for calculate the outlier score from residual.

'z1' : Z1 score.
'z2' : Z2 score.
'iqr' : IQR score.
'mad' : MAD score.
'isolationforest' : isolation forest score.
'dbscan' : DBSCAN.

Defaults to 'z1'.

thresholdfloat, optional

The threshold for outlier score. If the absolute value of outlier score is beyond the threshold, we consider the corresponding data point as an outlier.

Only valid when outlier_method = 'iqr', 'isolationforest', 'mad', 'z1', 'z2'. For outlier_method = 'isolationforest', when contamination is provided, threshold is not valid and outliers are decided by contamination.

Defaults to 3 when outlier_method is 'mad', 'z1' and 'z2'. Defaults to 1.5 when outlier_method is 'iqr'. Defaults to 0.7 when outlier_method is 'isolationforest'.

detect_seasonalitybool, optional

When calculating the residual,

False: Does not consider the seasonal decomposition.
True: Considers the seasonal decomposition.

Only valid when auto is False and smooth_method is 'median'.

Defaults to False.

alphafloat, optional

The criterion for the autocorrelation coefficient. The value range is (0, 1).

A larger value indicates a stricter requirement for seasonality.

Only valid when detect_seasonality is True.

Defaults to 0.2 if auto is False and defaults to 0.4 if auto` is True.

extrapolationbool, optional

Specifies whether to extrapolate the endpoints. Set to True when there is an end-point issue.

Only valid when detect_seasonality is True.

Defaults to False if auto is False and defaults to True if auto` is True.

periodsint, optional

When this parameter is not specified, the algorithm will search the seasonal period. When this parameter is specified between 2 and half of the series length, autocorrelation value is calculated for this number of periods and the result is compared to alpha parameter. If correlation value is equal to or higher than alpha, decomposition is executed with the value of periods. Otherwise, the residual is calculated without decomposition. For other value of parameter periods, the residual is also calculated without decomposition.

Only valid when detect_seasonality is True. If the user knows the seasonal period, specifying periods can speed up the calculation, especially when the time series is long.

No Default value.

random_stateint, optional

Specifies the seed for random number generator.

0: Uses the current time (in second) as seed.
Others: Uses the specified value as seed.

Only valid when outlier_method is 'isolationforest'.

Default to 0.

n_estimatorsint, optional

Specifies the number of trees to grow.

Only valid when outlier_method is 'isolationforest'.

Default to 100.

max_samplesint, optional

Specifies the number of samples to draw from input to train each tree. If max_samples is larger than the number of samples provided, all samples will be used for all trees.

Only valid when outlier_method is 'isolationforest'.

Default to 256.

bootstrapbool, optional

Specifies sampling method.

False: Sampling without replacement.
True: Sampling with replacement.

Only valid when outlier_method is 'isolationforest'.

Default to False.

contaminationdouble, optional

The proportion of outliers in the data set. Should be in the range (0, 0.5].

Only valid when outlier_method is 'isolationforest'. When outlier_method is 'isolationforest' and contamination is specified, threshold is not valid.

No Default value.

minptsint, optional

Specifies the minimum number of points required to form a cluster. The point itself is not included in minpts.

Only valid when outlier_method is 'dbscan'.

Defaults to 1.

epsfloat, optional

Specifies the scan radius.

Only valid when outlier_method is 'dbscan'.

Defaults to 0.5.

thread_ratiofloat, optional

The ratio of available threads.

0: single thread.
0~1: percentage.
Others: heuristically determined.

Only valid when detect_seasonality is True or outlier_method is 'isolationforest' or 'dbscan' or auto` is True.

Defaults to -1.

References

Outlier detection methods implemented in this class are commonly consisted of two steps:

Residual Extraction

Outlier Detection from Residual

Please refer to the above links for detailed description of all methods as well as related parameters.

Examples

Time series DataFrame df:

>>> df.collect()
    ID  RAW_DATA
  1       2.0
  2       2.5
  3       3.2
  4       2.8
......
15       5.3
16      10.0
17       4.6
18       4.4
19       4.8
20       5.1

Initialize the class:

>>> tsod = OutlierDetectionTS(detect_seasonality=False,
                              outlier_method='z1',
                              window_size=3,
                              threshold=3.0)
>>> res = tsod.fit_predict(data=df,
                           key='ID',
                           endog='RAW_DATA')

Outputs and attributes:

>>> res.collect()
    TIMESTAMP  RAW_DATA  RESIDUAL  OUTLIER_SCORE  IS_OUTLIER
         1       2.0       0.0      -0.297850           0
         2       2.5       0.0      -0.297850           0
         3       3.2       0.4      -0.010766           0
......
       14       5.1       0.0      -0.297850           0
       15       5.3       0.0      -0.297850           0
       16      10.0       4.7       3.075387           1
       17       4.6       0.0      -0.297850           0
       18       4.4      -0.2      -0.441392           0
       19       4.8       0.0      -0.297850           0
       20       5.1       0.0      -0.297850           0

>>> tsod.stats_.collect()
            STAT_NAME STAT_VALUE
DETECT_SEASONALITY          0
        OutlierNum          1
              Mean      0.415
Standard Deviation    1.39332
        HandleZero          0

Attributes:

stats_DataFrame

Data statistics, structured as follows:

STAT_NAME : Name of statistics.
STAT_VALUE : Value of statistics.

metrics_DataFrame

Relevant metrics, structured as follows:

NAME : Metric name.
VALUE : Metric value.

Methods

fit_predict(data[, key, endog])

Detection of outliers in time-series data.

fit_predict(data, key=None, endog=None)

Detection of outliers in time-series data.

Parameters:

dataDataFrame

Input data containing the target time-series.

data should have at least two columns: one is ID column, the other is raw data.

keystr, optional

Specifies the ID column, in this case the column that shows the order of time-series.

It is recommended that you always specifies this column manually.

Defaults to the first column of data if the index column of data is not provided. Otherwise, defaults to the index column of data.

endogstr, optional

Specifies the column that contains the values of time-series to be tested.

Defaults to the first non-key column.

Returns:

DataFrame

Outlier detection result, structured as follows:

TIMESTAMP : ID of data.
RAW_DATA : Original value.
RESIDUAL : Residual.
OUTLIER_SCORE : Outlier score.
IS_OUTLIER : 0: normal, 1: outlier.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the OutlierDetectionTS class also inherits methods from PALBase class, please refer to PAL Base for more details.