IsolationForest
- class hana_ml.algorithms.pal.preprocessing.IsolationForest(n_estimators=None, max_samples=None, max_features=None, bootstrap=None, random_state=None, thread_ratio=None, massive=False, group_params=None)
Isolation Forest generates anomaly scores for each sample.
- Parameters:
- n_estimatorsint, optional
Specifies the number of trees to grow.
Defaults to 100.
- max_samplesint, optional
Specifies the number of samples to draw from input to train each tree. If
max_samplesis larger than the number of samples provided, all samples will be used for all trees.Defaults to 256.
- max_featuresint, optional
Specifies the number of features to draw from input to train each tree. 0 means no sampling.
Defaults to 0.
- bootstrapbool, optional
Specifies the sampling method.
False: Sampling without replacement.
True: Sampling with replacement.
Defaults to False.
- random_stateint, optional
Specifies the seed for the random number generator.
0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.
Defaults to 0.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored, and this function heuristically determines the number of threads to use.
Defaults to -1.
- massivebool, optional
Specifies whether or not to use massive mode.
True: massive mode.
False: single mode.
For parameter settings in massive mode, you can use both group_params (please see the example below) or the original parameters. Using original parameters will apply to all groups. However, if you define some parameters for a group, the value of all original parameter settings will not be applicable to such a group.
An example is as follows:
In this example, as 'n_estimators' is set in group_params for Group_1, the parameter setting of 'random_state' is not applicable to Group_1.
Defaults to False.
- group_paramsdict, optional
If massive mode is activated (
massiveis True), input data shall be divided into different groups with different parameters applied.An example is as follows:
Valid only when
massiveis True and defaults to None.
- Attributes:
- model_DataFrame
Model content.
- error_msg_DataFrame
Error message. Only valid if
massiveis True when initializing an 'IsolationForest' instance.
Methods
fit(data[, key, features, group_key])Fit the model to the training dataset.
fit_predict(data[, key, features, ...])Train the isolation forest model and return labels for input data.
predict(data[, key, features, ...])Obtain the anomaly score of each sample based on the given Isolation Forest model.
Examples
>>> isof = IsolationForest(random_state=2, thread_ratio=0) >>> isof.fit(data=df_fit, key='ID', features=['V000', 'V001']) >>> res = isof.predict(data=df_predict, key='ID', features=['V000', 'V001'], contamination=0.25) >>> res.collect()
- fit(data, key=None, features=None, group_key=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
If
keyis not provided, then:if
datais indexed by a single column, thenkeydefaults to that index columnotherwise, it is assumed that
datacontains no ID column
- featuresstr or a list of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all non-key, non-group_key (if massive=True) columns.- group_keystr, optional
The column of group_key. The data type can be INT or NVARCHAR/VARCHAR. This parameter is only valid when
massiveis True.Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.
- Returns:
- A fitted object of class "IsolationForest".
- predict(data, key=None, features=None, contamination=None, thread_ratio=None, group_key=None, group_params=None, show_explainer=False, explain_scope=None, top_k_attributions=None)
Obtain the anomaly score of each sample based on the given Isolation Forest model.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Mandatory if
datais not indexed, or the index ofdatacontains multiple columns.Defaults to the single index column of
dataif not provided.- featuresa list of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all non-key columns.- contaminationfloat, optional
The proportion of outliers in the dataset. Should be in the range (0, 0.5].
Defaults to 0.1.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to -1.
- group_keystr, optional
The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. This parameter is only valid when
massiveis set as True in class instance initialization.Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.
- group_paramsdict, optional
If massive mode is activated (
massiveis set as True in class instance initialization), input data shall be divided into different groups with different parameters applied. This parameter specifies the parameter values of different groups in a dict format, where keys corresponding togroup_keywhile values should be a dict for parameter value assignments.An example is as follows:
Valid only when
massiveis set as True in class instance initialization.Defaults to None.
- show_explainerbool, optional
If True, output the shapley value in the REASON_CODE column.
Defaults to False.
- explain_scopestr, optional
- Defines the scope of explanation.
'outliers': Only outliers
'all': All samples
Available when
show_explaineris True.Defaults to 'outliers'.
- top_k_attributionsint, optional
Specifies the number (k) of key features to output that have the most contribution to the model's predictions or outcomes.
Available when
show_explaineris True.Defaults to 10.
- Returns:
- DataFrame 1
The forecast values, structured as follows:
GROUP_ID, group key column name. (only valid if
massiveis True when initializing an 'IsolationForest' instance)ID, ID column name.
SCORE, type DOUBLE, scoring result.
LABEL, type INTEGER, -1 for outliers and 1 for inliers.
REASON_CODE, type DOUBLE, Shapley value. (available only if
show_explaineris True).
- DataFrame 2
Error message. Only valid if
massiveis True when initializing an 'IsolationForest' instance.
- fit_predict(data, key=None, features=None, contamination=None, thread_ratio=None, group_key=None, group_params=None, show_explainer=False, explain_scope=None, top_k_attributions=None)
Train the isolation forest model and return labels for input data.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Mandatory if
datais not indexed, or the index ofdatacontains multiple columns.Defaults to the single index column of
dataif not provided.- featuresa list of str, optional
Names of the feature columns.
If
featuresis not provided, it defaults to all non-key columns.- contaminationfloat, optional
The proportion of outliers in the dataset. Should be in the range (0, 0.5].
Defaults to 0.1.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored, and this function heuristically determines the number of threads to use.
Defaults to -1.
- group_keystr, optional
The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. This parameter is only valid when
massiveis set as True in class instance initialization.Defaults to the first column of data if the index columns of data are not provided. Otherwise, defaults to the first column of index columns.
- group_paramsdict, optional
If massive mode is activated (
massiveis set as True in class instance initialization), input data shall be divided into different groups with different parameters applied. This parameter specifies the parameter values of different groups in a dict format, where keys correspond togroup_keywhile values should be a dict for parameter value assignments.An example is as follows:
>>> mif = IsolationForest(massive=True, random_state=2, group_params={'Group_1': {'n_estimators': 50}}) >>> mif.fit_predict(data=df, key="ID", group_key="GROUP_ID", features=['F1', 'F2'], group_params={'Group_1': {'contamination': 0.2}})
- <iframe allowtransparency="true" style="border:1px solid #ccc; background: #eeffcb;"
src="../../_static/isolation_fit_predict_example.html" width="100%" height="100%" sandbox="">
</iframe>
Valid only when
massiveis set as True in class instance initialization.Defaults to None.
- show_explainerbool, optional
If True, output the shapley value in the REASON_CODE column.
Defaults to False.
- explain_scopestr, optional
- Defines the scope of explanation.
'outliers': Only outliers
'all': All samples
Available when
show_explaineris True.Defaults to 'outliers'.
- top_k_attributionsint, optional
Specifies the number (k) of key features to output that have the most contribution to the model's predictions or outcomes.
Available when
show_explaineris True.Defaults to 10.
- Returns:
- DataFrame 1
The forecast values, structured as follows:
GROUP_ID, group key column name. (only valid if
massiveis True when initializing an 'IsolationForest' instance)ID, type INTEGER, ID column name.
SCORE, type DOUBLE, scoring result.
LABEL, type INTEGER, -1 for outliers and 1 for inliers.
SHAP_VALUE, type DOUBLE, Shapley value. (available only if
show_explaineris True).
- DataFrame 2
Error message. Only valid if
massiveis True when initializing an 'IsolationForest' instance.
Inherited Methods from PALBase
Besides those methods mentioned above, the IsolationForest class also inherits methods from PALBase class, please refer to PAL Base for more details.