OutlierDetectionRegression

class hana_ml.algorithms.pal.preprocessing.OutlierDetectionRegression(regression_model=None, threshold=None, iter_num=None, max_depth=None, thread_ratio=None, eta=None)

In regression, an outlier is a data point which is different from the general behavior of remaining data points. Outlier detection depends on the model of regression. In this procedure, we include two commonly used regression models, the linear model and the tree model.

In regression, the outlier detection procedure is divided into two steps. In step 1, we get the residual from the original data and the selected model. In step 2, we detect the outliers from the residual. Specifically, we calculate the outlier score of each data point from residual and compare the score with threshold in step 2. For linear model, the outlier score is the deleted studentized residual. For tree model, the outlier score is the z-score of the residual.

Parameters:

regression_modelstr, optional

'linear'' : linear model.
'tree' : tree model.

Defaults to 'linear'.

thresholdfloat, optional

The threshold for outlier score. If the absolute value of outlier score is beyond the threshold, OutlierDetectionRegression considers the corresponding data point as an outlier.

Defaults to 3.

iter_numint, optional

Total iteration number, which is equivalent to the number of trees in the final model.

Only valid when regression_model is 'tree'.

Defaults to 10.

max_depthint, optional

The maximum depth of each tree.

Only valid when regression_model is 'tree'.

Defaults to 6.

etafloat, optional

Learning rate of each iteration. Range: (0, 1].

Only valid when regression_model is 'tree'.

Defaults to 0.3.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to -1.

Attributes:

stats_DataFrame: Statistics.
metrics_DataFrame: Relevant metrics.

Methods

fit_predict(data[, key, features, label, ...])

Detection of outliers with regression model.

Examples

>>> tsreg = OutlierDetectionRegression(regression_model='linear')
>>> res = tsreg.fit_predict(data=df, key='ID', label='Y')
>>> res.collect()

fit_predict(data, key=None, features=None, label=None, categorical_variable=None)

Detection of outliers with regression model.

Parameters:

dataDataFrame

Input data.

keystr, optional

Name of the ID column in data.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;
otherwise, key` defaults to the first column;

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

Returns:

DataFrame

Result, structured as follows:

ID : ID of data.
TARGET : Dependent variable.
RESIDUAL : Residual.
OUTLIER_SCORE : Outlier score.
IS_OUTLIER : 0: normal, 1: outlier.

Inherited Methods from PALBase

Besides those methods mentioned above, the OutlierDetectionRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.