IsolationForest

class hana_ml.algorithms.pal.preprocessing.IsolationForest(n_estimators=None, max_samples=None, max_features=None, bootstrap=None, random_state=None, thread_ratio=None, massive=False, group_params=None)

Isolation Forest generates anomaly score of each sample.

Parameters:

n_estimatorsint, optional

Specifies the number of trees to grow.

Default to 100.

max_samplesint, optional

Specifies the number of samples to draw from input to train each tree. If max_samples is larger than the number of samples provided, all samples will be used for all trees.

Default to 256.

max_featuresint, optional

Specifies the number of features to draw from input to train each tree. 0 means no sampling.

Default to 0.

bootstrapbool, optional

Specifies sampling method.

False: Sampling without replacement.
True: Sampling with replacement.

Default to False.

random_stateint, optional

Specifies the seed for random number generator.

0: Uses the current time (in second) as seed.
Others: Uses the specified value as seed.

Default to 0.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to -1.

massivebool, optional

Specifies whether or not to use massive mode.

True : massive mode.
False : single mode.

For parameter setting in massive mode, you could use both group_params (please see the example below) or the original parameters. Using original parameters will apply for all groups. However, if you define some parameters of a group, the value of all original parameter setting will be not applicable to such group.

An example is as follows:

In this example, as 'n_estimators' is set in group_params for Group_1, parameter setting of 'random_state' is not applicable to Group_1.

Defaults to False.

group_paramsdict, optional

If massive mode is activated (massive is True), input data shall be divided into different groups with different parameters applied.

An example is as follows:

Valid only when massive is True and defaults to None.

Examples

>>> isof = IsolationForest(random_state=2, thread_ratio=0)
>>> isof.fit(data=df_fit, key='ID', features=['V000', 'V001'])
>>> res = isof.predict(data=df_predict,,
                       key='ID',
                       features=['V000', 'V001'],
                       contamination=0.25)
>>> res.collect()

Attributes:

model_DataFrame: Model content.
error_msg_DataFrame: Error message. Only valid if massive is True when initializing an 'IsolationForest' instance.

Methods

`fit`(data[, key, features, group_key])	Fit the model to the training dataset.
`fit_predict`(data[, key, features, ...])	Train the isolation forest model and returns labels for input data.
`get_model_metrics`()	Get the model metrics.
`get_score_metrics`()	Get the score metrics.
`predict`(data[, key, features, ...])	Obtain the anomaly score of each sample based on the given Isolation Forest model.

fit(data, key=None, features=None, group_key=None)

Fit the model to the training dataset.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column

otherwise, it is assumed that data contains no ID column

featuresstr or a list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

group_keystr, optional

The column of group_key. The data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group_params are valid.

This parameter is only valid when massive is True.

Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.

Returns:

A fitted object of class "IsolationForest".

predict(data, key=None, features=None, contamination=None, thread_ratio=None, group_key=None, group_params=None)

Obtain the anomaly score of each sample based on the given Isolation Forest model.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

contaminationfloat, optional

The proportion of outliers in the dataset. Should be in the range (0, 0.5].

Defaults to 0.1.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to -1.

group_keystr, optional

The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group_params are valid.

This parameter is only valid when massive is set as True in class instance initialization.

Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.

group_paramsdict, optional

If massive mode is activated (massive is set as True in class instance initialization), input data shall be divided into different groups with different parameters applied. This parameter specifies the parameter values of different groups in a dict format, where keys corresponding to group_key while values should be a dict for parameter value assignments.

An example is as follows:

Valid only when massive is set as True in class instance initialization.

Defaults to None.

Returns:

DataFrame 1

The aggregated forecasted values. Forecasted values, structured as follows:

ID, type INTEGER, ID column name.

SCORE, type DOUBLE, scoring result.

LABEL, type INTEGER, -1 for outliers and 1 for inliers.

DataFrame 2

Error message. Only valid if massive is True when initializing an 'IsolationForest' instance.

fit_predict(data, key=None, features=None, contamination=None, thread_ratio=None, group_key=None, group_params=None)

Train the isolation forest model and returns labels for input data.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

contaminationfloat, optional

The proportion of outliers in the dataset. Should be in the range (0, 0.5].

Defaults to 0.1.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to -1.

group_keystr, optional

The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group_params are valid.

This parameter is only valid when massive is set as True in class instance initialization.

Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.

group_paramsdict, optional

If massive mode is activated (massive is set as True in class instance initialization), input data shall be divided into different groups with different parameters applied. This parameter specifies the parameter values of different groups in a dict format, where keys corresponding to group_key while values should be a dict for parameter value assignments.

An example is as follows:

Valid only when massive is set as True in class instance initialization.

Defaults to None.

Returns:

DataFrame 1

The aggregated forecasted values. Forecasted values, structured as follows:

ID, type INTEGER, ID column name.

SCORE, type DOUBLE, Scoring result.

LABEL, type INTEGER, -1 for outliers and 1 for inliers.

DataFrame 2

Error message. Only valid if massive is True when initializing an 'IsolationForest' instance.

get_model_metrics()

Get the model metrics.

Returns:

DataFrame: The model metrics.

get_score_metrics()

Get the score metrics.

Returns:

DataFrame: The score metrics.

Inherited Methods from PALBase

Besides those methods mentioned above, the IsolationForest class also inherits methods from PALBase class, please refer to PAL Base for more details.