IsolationForest

class hana_ml.algorithms.pal.preprocessing.IsolationForest(n_estimators=None, max_samples=None, max_features=None, bootstrap=None, random_state=None, thread_ratio=None)

Isolation Forest generates the anomaly score of each sample.

Parameters
n_estimatorsint, optional

Specifies the number of trees to grow.

Default to 100.

max_samplesint, optional

Specifies the number of samples to draw from input to train each tree. If max_samples is larger than the number of samples provided, all samples will be used for all trees.

Default to 256.

max_featuresint, optional

Specifies the number of features to draw from input to train each tree. 0 means no sampling.

Default to 0.

bootstrapbool, optional

Specifies sampling method.

  • False: Sampling without replacement.

  • True: Sampling with replacement.

Default to False.

random_stateint, optional

Specifies the seed for random number generator.

  • 0: Uses the current time (in second) as seed.

  • Others: Uses the specified value as seed.

Default to 0.

thread_ratiofloat, optional

The ratio of available threads.

  • 0: single thread.

  • 0~1: percentage.

  • Others: heuristically determined.

Default to -1.

Examples

Input dataframe df:

>>> df.collect()
    ID  V000  V001
0    0  -2.0  -1.0
1    1  -1.0  -1.0
2    2  -1.0  -2.0
3    3   1.0   1.0
4    4   1.0   2.0
5    5   2.0   1.0
6    6   6.0   3.0
7    7  -4.0   7.0

Create an Isolation object:

>>> isof = IsolationForest(random_state=2,
                           thread_ratio=0)

Perform fit on the given data:

>>> isof.fit(df, key='ID', features=['V000', 'V001'])

Output:

>>> isof.model_.collect():
    TREE_INDEX                                      MODEL_CONTENT
0            0  {"NS":8,"NF":2,"FX":[0,1],"1":{"SF":0,"SV":5.5...
1            1  {"NS":8,"NF":2,"FX":[0,1],"1":{"SF":0,"SV":5.5...
2            2  {"NS":8,"NF":2,"FX":[1,0],"1":{"SF":0,"SV":4.6...
3            3  {"NS":8,"NF":2,"FX":[1,0],"1":{"SF":0,"SV":4.6...
4            4  {"NS":8,"NF":2,"FX":[1,0],"1":{"SF":0,"SV":5.3...
    ......

Perform predict on the fitted model:

>>> res = isof.predict(df,
                       key='ID',
                       features=['V000', 'V001'],
                       contamination=0.25)

Output:

>>> res.collect()
    ID     SCORE  LABEL
0    0  0.446897      1
1    1  0.411048      1
2    2  0.498931      1
3    3  0.407796      1
4    4  0.423264      1
5    5  0.443270      1
6    6  0.619513     -1
7    7  0.638874     -1
Attributes
model_DataFrame
  • TREE_INDEX: type INTEGER, indicates the tree number.

  • MODEL_CONTENT: type NCLOB, Model content.

Methods

fit(data[, key, features])

Trained the forests with input data.

fit_predict(data[, key, features, ...])

Perform fit on data and returns labels for X (1 for inliers, -1 for outliers.).

predict(data[, key, features, ...])

Makes time series forecast based on the LSTM model.

fit(data, key=None, features=None)

Trained the forests with input data.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

If key is not provided, then:

  • if data is indexed by a single column, then key defaults to that index column

  • otherwise, it is assumed that data contains no ID column

featuresstr or a list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

Returns
A fitted object of class "IsolationForest".
predict(data, key=None, features=None, contamination=None, thread_ratio=None)

Makes time series forecast based on the LSTM model.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

contaminationfloat, optional

The proportion of outliers in the data set. Should be in the range (0, 0.5].

Defaults to 0.1.

thread_ratiofloat, optional

Controls the proportion of available threads to be used for prediction.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to -1.

Returns
DataFrame

The aggregated forecasted values. Forecasted values, structured as follows:

  • ID, type INTEGER, ID column name.

  • SCORE, type DOUBLE, Scoring result.

  • LABEL, type INTEGER, -1 for outliers and 1 for inliers..

fit_predict(data, key=None, features=None, contamination=None, thread_ratio=None)

Perform fit on data and returns labels for X (1 for inliers, -1 for outliers.).

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

contaminationfloat, optional

The proportion of outliers in the data set. Should be in the range (0, 0.5].

Defaults to 0.1.

thread_ratiofloat, optional

Controls the proportion of available threads to be used for prediction.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to -1.

Returns
DataFrame

The aggregated forecasted values. Forecasted values, structured as follows:

  • ID, type INTEGER, ID column name.

  • SCORE, type DOUBLE, Scoring result.

  • LABEL, type INTEGER, -1 for outliers and 1 for inliers..

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the IsolationForest class also inherits methods from PALBase class, please refer to PAL Base for more details.