OutlierDetectionRegression
- class hana_ml.algorithms.pal.preprocessing.OutlierDetectionRegression(regression_model=None, threshold=None, iter_num=None, max_depth=None, thread_ratio=None, eta=None)
In regression, an outlier is a data point which is different from the general behavior of remaining data points. Outlier detection depends on the model of regression. In this procedure, we include two commonly used regression models, the linear model and the tree model.
In regression, the outlier detection procedure is divided into two steps. In step 1, we get the residual from the original data and the selected model. In step 2, we detect the outliers from the residual. Specifically, we calculate the outlier score of each data point from residual and compare the score with threshold in step 2. For linear model, the outlier score is the deleted studentized residual. For tree model, the outlier score is the z-score of the residual.
- Parameters:
- regression_modelstr, optional
'linear'' : linear model.
'tree' : tree model.
Defaults to 'linear'.
- thresholdfloat, optional
The threshold for outlier score. If the absolute value of outlier score is beyond the threshold, OutlierDetectionRegression considers the corresponding data point as an outlier.
Defaults to 3.
- iter_numint, optional
Total iteration number, which is equivalent to the number of trees in the final model.
Only valid when
regression_model
is 'tree'.Defaults to 10.
- max_depthint, optional
The maximum depth of each tree.
Only valid when
regression_model
is 'tree'.Defaults to 6.
- etafloat, optional
Learning rate of each iteration. Range: (0, 1].
Only valid when
regression_model
is 'tree'.Defaults to 0.3.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to -1.
Examples
>>> tsreg = OutlierDetectionRegression(regression_model='linear') >>> res = tsreg.fit_predict(data=df, key='ID', label='Y') >>> res.collect()
- Attributes:
- stats_DataFrame
Statistics.
- metrics_DataFrame
Relevant metrics.
Methods
fit_predict
(data[, key, features, label, ...])Detection of outliers with regression model.
- fit_predict(data, key=None, features=None, label=None, categorical_variable=None)
Detection of outliers with regression model.
- Parameters:
- dataDataFrame
Input data.
- keystr, optional
Name of the ID column in
data
.If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, key` defaults to the first column;
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- Returns:
- DataFrame
Result, structured as follows:
ID : ID of data.
TARGET : Dependent variable.
RESIDUAL : Residual.
OUTLIER_SCORE : Outlier score.
IS_OUTLIER : 0: normal, 1: outlier.
Inherited Methods from PALBase
Besides those methods mentioned above, the OutlierDetectionRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.