Methods for Residual Extraction in Time-Series Outlier Detection
In SAP HANA PAL(as well as in hana_ml.algorithms.pal package), the main idea of outlier detection for time-series data is to firstly decompose the given time-series into regular part and residual part, and then mark out outliers as those with with extreme residual values. Therefore, the extraction of residuals from input time-series is of crucial importance.
In implementation, three methods for residual extraction are adopted, with application scenarios and relevant parameters listed as follows:
1. Residual from Median Filter
This method is applicable to time-series that are not highly oscillating in local.
Relevant Parameters
window_size
: to use median filter, set the value of this parameter to be no less than 3
detect_seasonality
: to use median filter, set this parameter as False.
Example
>>> odt = OutlierDetectionTS(outlier_method='z1',
... detect_seasonality=False,#do not detect seasonality
... window_size=5)#window_size greater than 3
2. Residual from Seasonal Decomposition
This method is recommended if the time-series has strong seasonality.
Note
If the time series is not seasonal(i.e. the autocorrelation is below
the specified criterion alpha
), the method will fail to detect outliers.
In such case, seasonal decomposition is not applied to the original time series
and the residuals are all zeros.
Relevant Parameters
window_size
: to use seasonal decomposition only for residual extraction, set the value of this parameter as 1
detect_seasonality
: set as True to use seasonal decomposition
periods
: seasonal period, which will be automatically detected if not specified; however, since auto-detection of seasonal period can be time-consuming, if the user knows the actual value, please specify it to speed up the calculation.
alpha
: the anomaly threshold for the autocorrelation coefficient, with valid range (0, 1). A larger value means stricter requirement for seasonality.
extrapolation
: set as True if there is an end-point issue.
Examples
>>> odt = OutlierDetectionTS(outlier_method='z1',
... threshold=3,#3-sigma test for z1 score
... detect_seasonality=True,
... window_size=1,
... periods=7,#e.g. daily data with weekly seasonality
... extrapolation=True)
3. Residual Extraction from Median Filter and Seasonal Decomposition
This method better to time-series data with both strong seasonality and relatively smooth trend. Basically, it firstly decomposes the time-series using seasonal decomposition and get the residual(say, residual 1), and the perform smoothing on the resulting trend component to extract another residual(say, residual 2). The two residuals(i.e. residual 1 and residual 2) are added together to form the final residual.
Note
If the time-series is not seasonal(i.e. the autocorrelation is below the specified criterion alpha
),
the method will directly apply median filter to to original time-series without seasonal decomposition.
Relevant Parameter
window_size
: to extract residual by using seasonal decomposition combined with median filter, set the value of this parameter to be no less than 3.
detect_seasonality
: set as True to use seasonal decomposition
periods
: seasonal period, which will be automatically detected if not specified; however, since auto-detection of seasonal period can be time-consuming, if the user knows the actual value, please specify it to speed up the calculation.
alpha
: the anomaly threshold for the autocorrelation coefficient, with valid range (0, 1). A larger value means stricter requirement for seasonality.
extrapolation
: set as True if there is an end-point issue.
Examples
>>> odt = OutlierDetectionTS(outlier_method='z1',
... threshold=3,#3-sigma test for z1 score
... detect_seasonality=True,#(try to)use seasonal decompose
... window_size=5,# set to a number no less than 3 to activate median filter
... periods=7,#e.g. daily data with weekly seasonality
... extrapolation=True)
4. Meaningless Parameter Combination to be Avoided
The following parameter combination should be avoided:
detect_seasonality
= False,window_size
= 1.
In such case, neither seasonal decomposition nor median filter is activated, the residual is always all zeros.
Methods of Outlier Detection from Residual
Outlier detection follows immediately from the successful extraction of residual from time-series. In our implementation, six methods are adopted for achieving the objective, listed as follows:
Z1 score
Z2 score
IQR(inter-quartile range) score
MAD(median-absolute-deviation) score
Isolation score(derived using isolation forest)
DBSCAN
1. Z1 Score
Z1 score method is a general outlier detection method in statistics. It needs a threshold that is usually set to to 3.
This method used the deviation from residual mean as the measure of anomaly, and it usually works well. However, if there is any extreme outlier that deviates greatly from the mean, mistakes could occur.
Relevant Parameter
outlier_method
: set this parameter to 'z1' to use z1 score method
threshold
: the default value is 3 and it usually works well. A larger value of this parameter indicates stricter requirement for outlier, resulting in fewer outliers detected.
Example
>>> odt = OutlierDetectionTS(outlier_method='z1',
... threshold=3,
... detect_seasonality=False,
... window_size=5)
2. Z2 Score
Z2 score method is a modification of Z1 score such that it replaces the sample mean by statistical mean(0
for additive residual, 1 for multiplicative residual). Same as Z1 score, it also need a threshold
value that
is usually set to 3(the default value).
Relevant Parameter
outlier_method
: set this parameter to 'z2' to use Z2 score method
threshold
: the default value is 3 and it usually works well. A larger value of this parameter indicates stricter requirement for outlier, resulting in fewer outliers detected.
Example
>>> odt = OutlierDetectionTS(outlier_method='z2',
... threshold=3,
... detect_seasonality=False,
... window_size=5)
3. IQR Score
Inter-quartile-range(IQR) score method is a common statistic method to detect outliers. Unlike the mean and standard deviation in Z1 score method and Z2 score method, quartiles and quartile-range are robust statistics not easily affected by extreme outliers in the residual. In this sense, it is a robust method for outlier detection. This method also needs a threshold, which is usually set to 1.5(the default value for IQR score).
Relevent Parameters
outlier_method
: set this parameter to 'iqr' to use IQR score method.
threshold
: multipliers for the quartile range, defaults to 1.5. A larger value suggests stricter requirement for outliers, resulting in fewer detected outliers.
Example
>>> odt = OutlierDetectionTS(outlier_method='iqr',
... threshold=1.6,
... detect_seasonality=False,
... window_size=5)
4. MAD Score
Median absolute-deviation(MAD) is another common statistical method for outlier detection. Basically, this method uses sample median and MAD to define the range for inliers. This method is also robust, since both median and MAD are robust statistics that are not easily affected by extreme outliers.
This method also needs a threshold, which is usually set to 3(the default value).
Relevant Parameters
outlier_method
: set this parameter to 'mad' to use MAD score method.
threshold
: multiplier for MAD, defaults to 3. A larger value suggests stricter requirement for outliers, resulting in fewer detected outliers.
Example
>>> odt = OutlierDetectionTS(outlier_method='mad',
... threshold=3,
... detect_seasonality=False,
... window_size=5)
5. Isolation Forest Score
Isolation forest is effective in detecting outliers that are isolated from normal/regular data points.
The isolation forest score is between 0 and 1. It also needs a threshold, which is usually set to 0.7(the default value).
Relevant Parameters
outlier_method
: set to 'isolationforest' to use isolation forest score.
threshold
: defaults to 0.7. Higher value results in fewer outliers detected.
random_state
: the random seed.
n_estimators
: number of trees to grow, defaults to 100.
max_samples
: the number of samples drawn from input data to train each tree.
bootstrap
: whether or not to use bootstrap method to draw samples from input.
contamination
: proportion of outliers in the dataset defined by user, higher precedence thanthreshold
.
Example
>>> odt = OutlierDetectionTS(outlier_method='isolationforest',
... threshold=0.9,
... random_state=2022,
... bootstrap=True,
... detect_seasonality=False,
... window_size=5)
6. DBSCAN
Density-Based Spatial Clustering of Applications with Noise(DBSCAN) is applied to the normalized residual. As a consequence, normal points stay close enough are gathered into clusters, while(possibly) a few points that are isolated from all clusters are left out as noise(outliers). This method does not need any threshold.
Relevant Parameters
outlier_method
: set this parameter to 'dbscan' to use DBSCAN.
minpts
: the minimum number of points to form a cluster.
eps
: the scan radius for neighbour search in DBSCAN.
>>> odt = OutlierDetectionTS(outlier_method='dbscan',
... minpts=3,
... eps=1.0,
... detect_seasonality=False,
... window_size=5)