time series outlier detection algorithm — hanaml.OutlierDetectionTS • hana.ml.r

hanaml.OutlierDetectionTS is an R wrapper for SAP HANA PAL outlier detection for time series.

hanaml.OutlierDetectionTS(
  data = NULL,
  key = NULL,
  endog = NULL,
  window.size = NULL,
  outlier.method = NULL,
  threshold = NULL,
  detect.seasonality = NULL,
  alpha = NULL,
  extrapolation = NULL,
  periods = NULL,
  random.state = NULL,
  n.estimators = NULL,
  max.samples = NULL,
  bootstrap = NULL,
  contamination = NULL,
  minpts = NULL,
  eps = NULL,
  thread.ratio = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

endog

character, optional
The endogenous variable, i.e. time series.
Defaults to the first non-ID column.

window.size

integer, optional
Odd number, the window size for median filter, not less than 3.
Defaults to 3.

outlier.method

character, optional
The method for calculate the outlier score from residual.

"z1": Z1 score.
"z2": Z2 score.
"iqr": IQR score.
"mad": MAD score.
"isolationforest": isolation forest score.
"dbscan": DBSCAN.

Defaults to "z1".

threshold

double, optional
The threshold for outlier score. If the absolute value of outlier score is beyond the threshold, we consider the corresponding data point as an outlier.
Defaults to 3.

detect.seasonality

logical, optional
When calculating the residual,

FALSE: Does not consider the seasonal decomposition.
TRUE: Considers the seasonal decomposition.

Defaults to FALSE

alpha

double, optional
The criterion for the autocorrelation coefficient.
The valid value range is (0, 1). A larger value indicates a stricter requirement for seasonality.
Only valid when detect.seasonality is TRUE.
Defaults to 0.2.

extrapolation

logical, optional
Specifies whether to extrapolate the endpoints. Set to TRUE when there is an end-point issue.
Only valid when detect.seasonality is TRUE.
Defaults to FALSE.

periods

integer, optional
When this parameter is not specified, the algorithm will search the seasonal period.
When this parameter is specified between 2 and half of the series length, autocorrelation value is calculated for this number of periods and the result is compared to alpha parameter such that:

if correlation value is equal to or higher than alpha, decomposition is executed with the value of periods.
otherwise, the residual is calculated without decomposition.

For other value of parameter periods, the residual is also calculated without decomposition.
No Default value.

random.state

integer, optional
Specifies the seed for random number generator.

0: Uses the current time (in second) as seed.
Others: Uses the specified value as seed.

Only valid when outlier.method is 'isolationforest'.
Defaults to 0.

n.estimators

integer, optional
Specifies the number of trees to grow.
Only valid when outlier.method is 'isolationforest'.
Defaults to 100.

max.samples

integer, optional
Specifies the number of samples to draw from input to train each tree. If codemax_samples is larger than the number of samples provided, all samples will be used for all trees.
Only valid when outlier.method is 'isolationforest'.
Defaults to 256.

bootstrap

logical, optional
Specifies sampling method.

FALSE: Sampling without replacement.
TRUE: Sampling with replacement.

Defaults to FALSE.

contamination

double, optional
The proportion of outliers in the data set. Should be in the range (0, 0.5].
Only valid when outlier.method is 'isolationforest'.
When outlier.method is 'isolationforest' and contamination is specified, threshold is not valid.
Defaults to 0.2.

minpts

integer, optional
Specifies the minimum number of points required to form a cluster. The point itself is not included in minpts.
Only valid when outlier.method is 'dbscan'.
Defaults to 1.

eps

double, optional
Specifies the scan radius.
Only valid when outlier.method is 'dbscan'.
Defaults to 0.5.

thread.ratio

double, optional
The ratio of available threads.

0: single thread.
0~1: percentage.
Others: heuristically determined.

Only valid when detect.seasonality is TRUE.
Defaults to -1.

Value

Returns an "OutlierDetectionTS" object with the following attributes:

result DataFrame
Result of outlier detection.
stats DataFrame
Data statistics related to time-series outlier detection.
metrics DataFrame
Relevant metrics for time-series outlier detection.

Methods to Get Residuals

In many typical outlier detection methods for time-series data, the first step is often to decompose the series and extract its residual component. In current implementation of this package, there are three methods available to extract the residual, each depending on a special choice of parameters.

Getting Residual from Median Filter

This method is suitable for time-series which is smooth.
Relevant Parameters

window.size: Need to set the value of this parameter to be no less than 3.
detect.seasonality: Need to set the value of this parameter to be FALSE.

Getting Residual from Seasonal Decompose

This method can handle time-series data with strong seasonality.
It is noted that if the given time-series data fails the seasonality test, then seasonal decompose is not applied to the time-series data and the residual becomes all zeros.
Relevant Parameters

window.size: Need to set the value of this parameter to be 1.
detect.seasonality: Need to set the value of this parameter to be TRUE.
periods: This parameter specifies the value of seasonal period(if you know it). If not provided, the value is automatically detected from the given time-series data through seasonality test.
alpha: This parameter specifies the threshold value of significance for autocorrelation coefficients, with valid range (0, 1). A larger value indicates stricter requirement for seasonality. If all autocorrelation coefficients are below the specified value, then the time-series data is considered failing the seasonality test(i.e. having no seasonality)
extrapolation: It is suggested to set the value of this parameter to be TRUE for handling end-point issues through end-point extrapolation when seasonal decompose is applied.

Getting Residual from Median Filter + Seasonal Decompose

This method can also handle time-series data with strong seasonality. It suits better for cases when the time-series data becomes relatively smooth after removing the seasonal component.
Note that if the given time-series data fails the seasonality test, then only median filter is applied.
Relevant Parameters

window.size: Need to set the value of this parameter to be no less than 3.
detect.seasonality: Need to set the value of this parameter to be TRUE.
periods: This parameter specifies the value of seasonal period(if you know it). If not provided, the value is automatically detected from the given time-series data through seasonality test.
alpha: This parameter specifies the threshold value of significance for autocorrelation coefficients, with valid range (0, 1). A larger value indicates stricter requirement for seasonality. If all autocorrelation coefficients are below the specified value, then the time-series data is considered failing the seasonality test(i.e. having no seasonality)
extrapolation: It is suggested to set the value of this parameter to be TRUE for handling end-point issues through end-point extrapolation when seasonal decompose is applied.

Parameter Combination that Should be Avoided

detect.seasonality = 0, window.size = 1.
Withe the above parameter combination, the residual component becomes all zeros, no matter what the input time-series data is.

Methods to Detect Outliers from Residual

After extracting the residual, we detect outliers from it. Six methods are implemented: Z1 score, Z2 score, Inter-Quartile Range(IQR) score, Mean Absolute Deviation(MAD) score, Isolation Forest score, and DBSCAN.

Z1 score

Relevant Parameters

outlier.method: To use Z1 score method, set the value of this parameter as "z1".
threshold: The default value of this parameter for Z1 score is 3, which works well in most cases. A larger value of this parameter means stricter requirement for identifying outliers.

Z2 score

Relevant Parameters

outlier.method: To use Z2 score method, set the value of this parameter as "z2".
threshold: The default value of this parameter for Z2 score is 3, which usually works well. A larger value of this parameter means stricter requirement for identifying outliers.

IQR score

Relevant Parameters

outlier.method: To use IQR score method, set the value of this parameter as "iqr".
threshold: The default value of this parameter for Z1 score is 1.5. A larger value of this parameter means stricter requirement for identifying outliers.

MAD score

Relevant Parameters

outlier.method: To use MAD score method, set the value of this parameter as "mad".
threshold: The default value of this parameter for Z1 score is 3. A larger value of this parameter means stricter requirement for identifying outliers.

Isolation Forest score

Relevant Parameters

outlier.method: To use Isolation Forest score method, set the value of this parameter as "isolationforest".
threshold: The default value of this parameter for Isolation Forest score is 0.7. A larger value of this parameter means stricter requirement for identifying outliers.
random.state: This parameter specifies the seed for random number generation.
n.estimators: This parameter specifies the number of trees to grow in isolation forest.
max.samples: This parameter specifies the number of samples to draw from input to train each tree. If the specified value exceeds the number of samples provided, all samples will be used for all trees in the training phase.
bootstrap: Specifies whether or not to used boostrap resampling(i.e. random sampling with replacement) method when drawing samples from the input data. Set the value to be TRUE to use bootstrap resampling.
contamination: This parameter specifies the proportion of outliers in the dataset. If it is specified, then threshold is not valid.

DBSCAN

Relelvant Parameters

outlier.method: Set the value of this parameter to be "dbscan" to use DBSCAN method for outlier detection.
minpts: This parameter specifies the minimum number of neighbors for a point to be considered as a core point, where the point itself is excluded.
eps: This parameter specifies the maximum distance between two points for being neighbors of each other.

Examples

Input DataFrame data:


> data$Collect()
   ID RAW_DATA
1   1      2.0
2   2      2.5
3   3      3.2
4   4      2.8
5   5      2.4
6   6      2.9
7   7      3.1
......
18 18      4.4
19 19      4.8
20 20      5.1

Invoke OutlierDetectionTS:


> od <- hanaml.OutlierDetectionTS(data = df.fit,
                                  key = "ID",
                                  endog = "RAW_DATA",
                                  detect.seasonality = FALSE,
                                  outlier.method = "z1",
                                  window.size = 3,
                                  threshold = 3.0)

Output:


> print(od$result$Collect())
   TIMESTAMP RAW_DATA RESIDUAL OUTLIER_SCORE IS_OUTLIER
1          1      2.0      0.0   -0.29784963          0
2          2      2.5      0.0   -0.29784963          0
3          3      3.2      0.4   -0.01076565          0
4          4      2.8      0.0   -0.29784963          0
5          5      2.4     -0.4   -0.58493360          0
6          6      2.9      0.0   -0.29784963          0
7          7      3.1      0.0   -0.29784963          0
8          8      8.0      4.2    2.71653214          0
......
> print(od$stats$Collect())
           STAT_NAME STAT_VALUE
1 DETECT_SEASONALITY          0
2         OutlierNum          1
3               Mean      0.415
4 Standard Deviation    1.39332
5         HandleZero          0