OutlierDetectionTS
- class hana_ml.algorithms.pal.tsa.outlier_detection.OutlierDetectionTS(auto=None, detect_intermittent_ts=None, smooth_method=None, window_size=None, loess_lag=None, current_value_flag=None, outlier_method=None, threshold=None, detect_seasonality=None, alpha=None, extrapolation=None, periods=None, random_state=None, n_estimators=None, max_samples=None, bootstrap=None, contamination=None, minpts=None, eps=None, distiance_method=None, dbscan_normalization=None, dbscan_outlier_from_cluster=None, residual_usage=None, voting_config=None, voting_outlier_method_criterion=None, thread_ratio=None, massive=False, group_params=None)
Outlier detection for time-series. In time series, an outlier is a data point that is different from the general behavior of remaining data points. In this algorithm, the outlier detection procedure is divided into two steps. In step 1, we get the residual from the original series. In step 2, we detect the outliers from the residual.
- Parameters:
- autobool, optional
True : automatic method to get residual.
False : manual method to get residual.
Defaults to True.
- detect_intermittent_tsbool, optional
True : detects whether the time series is intermittent.
False : does not detect whether the time series is intermittent.
only valid when
auto
is True. If input data is intermittent time series, it will not do outlier detectionDefaults to False.
- smooth_methodstr, optional
the method to get the residual.
'no' : no smoothing method is used.
'median' : median filter.
'loess' : LOESS (locally estimated scatterplot smoothing) or LOWESS (locally weighted scatterplot smoothing) is a locally weighted linear regression method. This method is applicable to the time series which is non-seasonal. This method is also suitable for non-smooth time series.
'super' : super smoother. This method combines a set of LOESS methods. Like LOESS, this method is applicable to non-seasonal time series. This method is also suitable for non-smooth time series.
only valid when
auto
is False.Defaults to 'median'.
- window_sizeint, optional
Odd number, the window size for median filter, not less than 3.
The value 1 means median filter is not applied. Only valid when
auto
is False andsmooth_method
is 'median'.Defaults to 3.
- loess_lagint, optional
Odd number, the lag for LOESS, not less than 3.
Only valid when
auto
is False andsmooth_method
is 'loess'.Defaults to 7.
- current_value_flagbool, optional
Whether to take the current data point when using LOESS smoothing method.
True : takes the current data point.
False : does not take the current data point.
For example, to estimate the value at time t with the window [t-3, t-2, t-1, t, t+1, t+2, t+3], taking the current data point means estimating the value at t with the real data points at [t-3, t-2, t-1, t, t+1, t+2, t+3], while not taking the current data point means estimating the value at t with the real data points at [t-3, t-2, t-1, t+1, t+2, t+3], without the real data point at t.
Only valid when
auto
is False andsmooth_method
is 'median'.Defaults to False.
- outlier_methodstr, optional
The method for calculate the outlier score from residual.
'z1' : Z1 score.
'z2' : Z2 score.
'iqr' : IQR score.
'mad' : MAD score.
'isolationforest' : isolation forest score.
'dbscan' : DBSCAN.
Defaults to 'z1'.
- thresholdfloat, optional
The threshold for outlier score. If the absolute value of outlier score is beyond the threshold, we consider the corresponding data point as an outlier.
Only valid when
outlier_method
= 'iqr', 'isolationforest', 'mad', 'z1', 'z2'. Foroutlier_method
= 'isolationforest', whencontamination
is provided,threshold
is not valid and outliers are decided bycontamination
.Defaults to 3 when
outlier_method
is 'mad', 'z1' and 'z2'. Defaults to 1.5 whenoutlier_method
is 'iqr'. Defaults to 0.7 whenoutlier_method
is 'isolationforest'.- detect_seasonalitybool, optional
When calculating the residual,
False: Does not consider the seasonal decomposition.
True: Considers the seasonal decomposition.
Only valid when
auto
is False andsmooth_method
is 'median'.Defaults to False.
- alphafloat, optional
The criterion for the autocorrelation coefficient. The value range is (0, 1).
A larger value indicates a stricter requirement for seasonality.
Only valid when
detect_seasonality
is True.Defaults to 0.2 if
auto
is False and defaults to 0.4 if auto` is True.- extrapolationbool, optional
Specifies whether to extrapolate the endpoints. Set to True when there is an end-point issue.
Only valid when
detect_seasonality
is True.Defaults to False if
auto
is False and defaults to True if auto` is True.- periodsint, optional
When this parameter is not specified, the algorithm will search the seasonal period. When this parameter is specified between 2 and half of the series length, autocorrelation value is calculated for this number of periods and the result is compared to
alpha
parameter. If correlation value is equal to or higher thanalpha
, decomposition is executed with the value ofperiods
. Otherwise, the residual is calculated without decomposition. For other value of parameterperiods
, the residual is also calculated without decomposition.Only valid when
detect_seasonality
is True. If the user knows the seasonal period, specifyingperiods
can speed up the calculation, especially when the time series is long.No Default value.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time (in second) as seed.
Others: Uses the specified value as seed.
Only valid when
outlier_method
is 'isolationforest'.Default to 0.
- n_estimatorsint, optional
Specifies the number of trees to grow.
Only valid when
outlier_method
is 'isolationforest'.Default to 100.
- max_samplesint, optional
Specifies the number of samples to draw from input to train each tree. If
max_samples
is larger than the number of samples provided, all samples will be used for all trees.Only valid when
outlier_method
is 'isolationforest'.Default to 256.
- bootstrapbool, optional
Specifies sampling method.
False: Sampling without replacement.
True: Sampling with replacement.
Only valid when
outlier_method
is 'isolationforest'.Default to False.
- contaminationdouble, optional
The proportion of outliers in the dataset. Should be in the range (0, 0.5].
Only valid when
outlier_method
is 'isolationforest'. Whenoutlier_method
is 'isolationforest' andcontamination
is specified,threshold
is not valid.No Default value.
- minptsint, optional
Specifies the minimum number of points required to form a cluster. The point itself is not included in
minpts
.Only valid when
outlier_method
is 'dbscan'.Defaults to 1.
- epsfloat, optional
Specifies the scan radius.
Only valid when
outlier_method
is 'dbscan'.Defaults to 0.5.
- distiance_method{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'standardized_euclidean', 'cosine'}, optional
Specifies the method to compute the distance between two points.
Only valid when
outlier_method
is 'dbscan' or whenvoting_config
includes 'dbscan' as a voting outlier detection method.Defaults to 'euclidean'.
- dbscan_normalizationbool, optional
Specifies whether to take normalization of data before applying it to DBSCAN method.
False: Does not take normalization.
True: Takes normalization.
Only valid when
outlier_method
is 'dbscan' or whenvoting_config
includes 'dbscan' as a voting outlier detection method.Defaults to False.
- dbscan_outlier_from_clusterbool, optional
Specifies how to take outliers from DBSCAN result.
False: Takes the largest cluster as normal points and others as outliers.
True: Takes the points with CLUSTER_ID = -1 as outliers.
Only valid when
outlier_method
is 'dbscan' or whenvoting_config
includes 'dbscan' as a voting outlier detection method.Defaults to False.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use. Only valid when
detect_seasonality
is True oroutlier_method
is 'isolationforest' or 'dbscan' or auto` is True.Defaults to -1.
- residual_usage{"outlier_detection", "outlier_correction"}, optional
Specifies which residual to output.
'outlier_detection': Residual for outlier detection.
'outlier_correction': Residual for outlier correction.
Defaults to 'outlier_detection'.
- voting_configdict, optional
Specifies the outlier detection method used in the voting and their conrresponding parameters and values. For each method, the options of parameters are as below:
'z1':
threshold
.'z2':
threshold
.'mad':
threshold
.'iqr':
threshold
.'isolationforest':
random_state
,n_estimators
,max_samples
,bootstrap
,threshold
,contamination
.'dbscan':
eps
,minpts
,distiance_method
,dbscan_normalization
,dbscan_outlier_from_cluster
.
An example is :
>>> od = OutlierDetectionTS( voting_config={"z1": {"threshold":10}, "z2": {"threshold":1}, "mad":{"threshold":3}, "iqr": {"threshold":2}, "isolationforest": {"contamination":0.2}, "dbscan": {'minpts':1, "eps":0.5, "distiance_method":"euclidean", "dbscan_normalization":True, "dbscan_outlier_from_cluster":False}}, residual_usage="outlier_correction")
No default value.
- voting_outlier_method_criterionfloat, optional
The criterion for outlier voting. Suppose the number of voters is N. If more than int(criterion * N) voters detect the point as an outlier, the point will be treated as an outlier.
Only valid when
voting_config
is not None.Defaults to 0.5.
- massivebool, optional
Specifies whether or not to use massive mode.
True : massive mode.
False : single mode.
For parameter setting in massive mode, you could use both group_params (please see the example below) or the original parameters. Using original parameters will apply for all groups. However, if you define some parameters of a group, the value of all original parameter setting will be not applicable to such group.
An example is as follows:
>>> od = OutlierDetectionTS((massive=True) >>> od.fit_predict(data=df, key='ID', endog='Y', group_key="GROUP_ID")
Defaults to False.
- group_paramsdict, optional
If massive mode is activated (
massive
is True), input data shall be divided into different groups with different parameters applied.An example is as follows:
>>> od = OutlierDetectionTS(massive=True, group_params={'Group_1' : {'auto' : False}, 'Group_2' : {'auto' : True}}) >>> od.fit_predict(data=df, key='ID', endog='Y', group_key="GROUP_ID")
Valid only when
massive
is True and defaults to None.
References
Outlier detection methods implemented in this class are commonly consisted of two steps:
Please refer to the above links for detailed description of all methods as well as related parameters.
Examples
>>> tsod = OutlierDetectionTS(detect_seasonality=False, outlier_method='z1', window_size=3, threshold=3.0) >>> res = tsod.fit_predict(data=df, key='ID', endog='Y')
Outputs:
>>> res.collect() >>> tsod.stats_.collect()
- Attributes:
- stats_DataFrame
Data statistics, structured as follows:
STAT_NAME : Name of statistics.
STAT_VALUE : Value of statistics.
- error_msg_DataFrame
Error message. Only valid if
massive
is True when initializing an 'OutlierDetectionTS' instance.
Methods
fit_predict
(data[, key, endog, group_key, ...])Detection of outliers in time-series data.
Get the model metrics.
Get the score metrics.
- fit_predict(data, key=None, endog=None, group_key=None, group_params=None)
Detection of outliers in time-series data.
- Parameters:
- dataDataFrame
Input data containing the target time-series.
data
should have at least two columns: one is ID column, the other is raw data.- keystr, optional
Specifies the ID column, in this case the column that shows the order of time-series.
It is recommended that you always specifies this column manually.
Defaults to the first column of data if the index column of data is not provided. Otherwise, defaults to the index column of data.
- endogstr, optional
Specifies the column that contains the values of time-series to be tested.
Defaults to the first non-key column.
- group_keystr, optional
The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group_params are valid.
This parameter is only valid when
massive
is True in class instance initialization.Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.
- group_paramsdict, optional
If massive mode is activated (
massive
is set as True in class instance initialization), input data for classification shall be divided into different groups with different classification parameters applied. This parameter specifies the parameter values of the chosen classification algorithmfunc
in fit() w.r.t. different groups in a dict format, where keys corresponding togroup_key
while values should be a dict for classification algorithm parameter value assignments.An example is as follows:
>>> uc = UnifiedClassification(func='logisticregression', multi_class=True, massive=True, max_iter=10, group_params={'Group_1': {'solver': 'auto'}}) >>> uc.fit(data=df, key='ID', features=["OUTLOOK" ,"TEMP", "HUMIDITY","WINDY"], label="CLASS", group_key="GROUP_ID", background_size=4, group_params={'Group_1':{'background_random_state':2}})
Valid only when
massive
is set as True in class instance initialization.Defaults to None.
- Returns:
- DataFrame
Outlier detection result, structured as follows:
TIMESTAMP : ID of data.
RAW_DATA : Original value.
RESIDUAL : Residual.
OUTLIER_SCORE : Outlier score.
IS_OUTLIER : 0: normal, 1: outlier.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the OutlierDetectionTS class also inherits methods from PALBase class, please refer to PAL Base for more details.