hanaml.OutlierDetectionTS.Rd
hanaml.OutlierDetectionTS is an R wrapper for SAP HANA PAL outlier detection for time series.
hanaml.OutlierDetectionTS(
data = NULL,
key = NULL,
endog = NULL,
window.size = NULL,
outlier.method = NULL,
threshold = NULL,
detect.seasonality = NULL,
alpha = NULL,
extrapolation = NULL,
periods = NULL,
random.state = NULL,
n.estimators = NULL,
max.samples = NULL,
bootstrap = NULL,
contamination = NULL,
minpts = NULL,
eps = NULL,
thread.ratio = NULL
)
DataFrame
DataFrame containting the data.
character
Name of the ID column.
character, optional
The endogenous variable, i.e. time series.
Defaults to the first non-ID column.
integer, optional
Odd number, the window size for median filter, not less than 3.
Defaults to 3.
character, optional
The method for calculate the outlier score from residual.
"z1": Z1 score.
"z2": Z2 score.
"iqr": IQR score.
"mad": MAD score.
"isolationforest": isolation forest score.
"dbscan": DBSCAN.
Defaults to "z1".
double, optional
The threshold for outlier score. If the absolute value of outlier score is beyond the
threshold, we consider the corresponding data point as an outlier.
Defaults to 3.
logical, optional
When calculating the residual,
FALSE: Does not consider the seasonal decomposition.
TRUE: Considers the seasonal decomposition.
Defaults to FALSE
double, optional
The criterion for the autocorrelation coefficient.
The valid value range is (0, 1).
A larger value indicates a stricter requirement for seasonality.
Only valid when detect.seasonality
is TRUE.
Defaults to 0.2.
logical, optional
Specifies whether to extrapolate the endpoints.
Set to TRUE when there is an end-point issue.
Only valid when detect.seasonality
is TRUE.
Defaults to FALSE.
integer, optional
When this parameter is not specified, the algorithm will search the seasonal period.
When this parameter is specified between 2 and half of the series length, autocorrelation value
is calculated for this number of periods and the result is compared to alpha
parameter such that:
if correlation value is equal to or higher than alpha
, decomposition
is executed with the value of periods
.
otherwise, the residual is calculated without decomposition.
For other value of parameter periods
, the residual is also calculated without decomposition.
No Default value.
integer, optional
Specifies the seed for random number generator.
0: Uses the current time (in second) as seed.
Others: Uses the specified value as seed.
Only valid when outlier.method
is 'isolationforest'.
Defaults to 0.
integer, optional
Specifies the number of trees to grow.
Only valid when outlier.method
is 'isolationforest'.
Defaults to 100.
integer, optional
Specifies the number of samples to draw from input to train each tree.
If codemax_samples is larger than the number of samples provided,
all samples will be used for all trees.
Only valid when outlier.method
is 'isolationforest'.
Defaults to 256.
logical, optional
Specifies sampling method.
FALSE: Sampling without replacement.
TRUE: Sampling with replacement.
Defaults to FALSE.
double, optional
The proportion of outliers in the data set. Should be in the range (0, 0.5].
Only valid when outlier.method
is 'isolationforest'.
When outlier.method
is 'isolationforest' and contamination
is specified, threshold
is not valid.
Defaults to 0.2.
integer, optional
Specifies the minimum number of points required to form a cluster. The point itself is not included in minpts
.
Only valid when outlier.method
is 'dbscan'.
Defaults to 1.
double, optional
Specifies the scan radius.
Only valid when outlier.method
is 'dbscan'.
Defaults to 0.5.
double, optional
The ratio of available threads.
0: single thread.
0~1: percentage.
Others: heuristically determined.
Only valid when detect.seasonality
is TRUE.
Defaults to -1.
Returns an "OutlierDetectionTS" object with the following attributes:
result DataFrame
Result of outlier detection.
stats DataFrame
Data statistics related to time-series outlier detection.
metrics DataFrame
Relevant metrics for time-series outlier detection.
In many typical outlier detection methods for time-series data, the first step is often
to decompose the series and extract its residual component. In current implementation of
this package, there are three methods available to extract the residual, each depending on
a special choice of parameters.
Getting Residual from Median Filter
This method is suitable for time-series which is smooth.
Relevant Parameters
window.size
: Need to set the value of this parameter to be no less than 3.
detect.seasonality
: Need to set the value of this parameter to be FALSE.
Getting Residual from Seasonal Decompose
This method can handle time-series data with strong seasonality.
It is noted that if the given time-series data fails the seasonality test, then seasonal decompose
is not applied to the time-series data and the residual becomes all zeros.
Relevant Parameters
window.size
: Need to set the value of this parameter to be 1.
detect.seasonality
: Need to set the value of this parameter to be TRUE.
periods
: This parameter specifies the value of seasonal period(if you know it).
If not provided, the value is automatically detected from the given time-series data through seasonality
test.
alpha
: This parameter specifies the threshold value of significance for autocorrelation
coefficients, with valid range (0, 1). A larger value indicates stricter requirement for seasonality.
If all autocorrelation coefficients are below the specified value, then the time-series data is considered
failing the seasonality test(i.e. having no seasonality)
extrapolation
: It is suggested to set the value of this parameter to be TRUE
for handling end-point issues through end-point extrapolation when seasonal decompose is applied.
Getting Residual from Median Filter + Seasonal Decompose
This method can also handle time-series data with strong seasonality. It suits better for cases
when the time-series data becomes relatively smooth after removing the seasonal component.
Note that if the given time-series data fails the seasonality test, then only median filter
is applied.
Relevant Parameters
window.size
: Need to set the value of this parameter to be no less than 3.
detect.seasonality
: Need to set the value of this parameter to be TRUE.
periods
: This parameter specifies the value of seasonal period(if you know it).
If not provided, the value is automatically detected from the given time-series data through seasonality
test.
alpha
: This parameter specifies the threshold value of significance for autocorrelation
coefficients, with valid range (0, 1). A larger value indicates stricter requirement for seasonality.
If all autocorrelation coefficients are below the specified value, then the time-series data is considered
failing the seasonality test(i.e. having no seasonality)
extrapolation
: It is suggested to set the value of this parameter to be TRUE
for handling end-point issues through end-point extrapolation when seasonal decompose is applied.
Parameter Combination that Should be Avoideddetect.seasonality
= 0, window.size
= 1.
Withe the above parameter combination, the residual component becomes all zeros, no matter
what the input time-series data is.
After extracting the residual, we detect outliers from it. Six methods are implemented:
Z1 score, Z2 score, Inter-Quartile Range(IQR) score, Mean Absolute Deviation(MAD) score,
Isolation Forest score, and DBSCAN.
Z1 score
Relevant Parameters
outlier.method
: To use Z1 score method, set the value of this parameter as "z1".
threshold
: The default value of this parameter for Z1 score is 3, which works
well in most cases. A larger value of this parameter means stricter requirement for identifying
outliers.
Z2 score
Relevant Parameters
outlier.method
: To use Z2 score method, set the value of this parameter as "z2".
threshold
: The default value of this parameter for Z2 score is 3, which usually works
well. A larger value of this parameter means stricter requirement for identifying outliers.
IQR score
Relevant Parameters
outlier.method
: To use IQR score method, set the value of this parameter as "iqr".
threshold
: The default value of this parameter for Z1 score is 1.5.
A larger value of this parameter means stricter requirement for identifying outliers.
MAD score
Relevant Parameters
outlier.method
: To use MAD score method, set the value of this parameter as "mad".
threshold
: The default value of this parameter for Z1 score is 3.
A larger value of this parameter means stricter requirement for identifying outliers.
Isolation Forest score
Relevant Parameters
outlier.method
: To use Isolation Forest score method, set the value of this parameter
as "isolationforest".
threshold
: The default value of this parameter for Isolation Forest score is 0.7.
A larger value of this parameter means stricter requirement for identifying outliers.
random.state
: This parameter specifies the seed for random number generation.
n.estimators
: This parameter specifies the number of trees to grow in isolation forest.
max.samples
: This parameter specifies the number of samples to draw from input to train
each tree. If the specified value exceeds the number of samples provided, all samples will be used for
all trees in the training phase.
bootstrap
: Specifies whether or not to used boostrap resampling(i.e. random sampling
with replacement) method when drawing
samples from the input data. Set the value to be TRUE to use bootstrap resampling.
contamination
: This parameter specifies the proportion of outliers in the dataset.
If it is specified, then threshold
is not valid.
DBSCAN
Relelvant Parameters
outlier.method
: Set the value of this parameter to be "dbscan" to use DBSCAN
method for outlier detection.
minpts
: This parameter specifies the minimum number of neighbors
for a point to be considered as a core point, where the point itself is excluded.
eps
: This parameter specifies the maximum distance between two points for being
neighbors of each other.
Input DataFrame data:
> data$Collect()
ID RAW_DATA
1 1 2.0
2 2 2.5
3 3 3.2
4 4 2.8
5 5 2.4
6 6 2.9
7 7 3.1
......
18 18 4.4
19 19 4.8
20 20 5.1
Invoke OutlierDetectionTS:
> od <- hanaml.OutlierDetectionTS(data = df.fit,
key = "ID",
endog = "RAW_DATA",
detect.seasonality = FALSE,
outlier.method = "z1",
window.size = 3,
threshold = 3.0)
Output:
> print(od$result$Collect())
TIMESTAMP RAW_DATA RESIDUAL OUTLIER_SCORE IS_OUTLIER
1 1 2.0 0.0 -0.29784963 0
2 2 2.5 0.0 -0.29784963 0
3 3 3.2 0.4 -0.01076565 0
4 4 2.8 0.0 -0.29784963 0
5 5 2.4 -0.4 -0.58493360 0
6 6 2.9 0.0 -0.29784963 0
7 7 3.1 0.0 -0.29784963 0
8 8 8.0 4.2 2.71653214 0
......
> print(od$stats$Collect())
STAT_NAME STAT_VALUE
1 DETECT_SEASONALITY 0
2 OutlierNum 1
3 Mean 0.415
4 Standard Deviation 1.39332
5 HandleZero 0