IsolationForest
- class hana_ml.algorithms.pal.preprocessing.IsolationForest(n_estimators=None, max_samples=None, max_features=None, bootstrap=None, random_state=None, thread_ratio=None, massive=False, group_params=None)
Isolation Forest generates anomaly score of each sample.
- Parameters:
- n_estimatorsint, optional
Specifies the number of trees to grow.
Default to 100.
- max_samplesint, optional
Specifies the number of samples to draw from input to train each tree. If
max_samples
is larger than the number of samples provided, all samples will be used for all trees.Default to 256.
- max_featuresint, optional
Specifies the number of features to draw from input to train each tree. 0 means no sampling.
Default to 0.
- bootstrapbool, optional
Specifies sampling method.
False: Sampling without replacement.
True: Sampling with replacement.
Default to False.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time (in second) as seed.
Others: Uses the specified value as seed.
Default to 0.
- thread_ratiofloat, optional
The ratio of available threads.
0: single thread.
0~1: percentage.
Others: heuristically determined.
Default to -1.
- massivebool, optional
Specifies whether or not to use massive mode.
True : massive mode.
False : single mode.
For parameter setting in massive mode, you could use both group_params (please see the example below) or the original parameters. Using original parameters will apply for all groups. However, if you define some parameters of a group, the value of all original parameter setting will be not applicable to such group.
An example is as follows:
In this example, as 'n_estimators' is set in group_params for Group_1, parameter setting of 'random_state' is not applicable to Group_1.
Defaults to False.
- group_paramsdict, optional
If massive mode is activated (
massive
is True), input data shall be divided into different groups with different parameters applied.An example is as follows:
Valid only when
massive
is True and defaults to None.
Examples
Input dataframe df:
>>> df.collect() ID V000 V001 0 0 -2.0 -1.0 1 1 -1.0 -1.0 2 2 -1.0 -2.0 3 3 1.0 1.0 4 4 1.0 2.0 5 5 2.0 1.0 6 6 6.0 3.0 7 7 -4.0 7.0
Create an Isolation object:
>>> isof = IsolationForest(random_state=2, thread_ratio=0)
Perform fit on the given data:
>>> isof.fit(data=df, key='ID', features=['V000', 'V001'])
Output:
>>> isof.model_.collect(): TREE_INDEX MODEL_CONTENT 0 0 {"NS":8,"NF":2,"FX":[0,1],"1":{"SF":0,"SV":5.5... 1 1 {"NS":8,"NF":2,"FX":[0,1],"1":{"SF":0,"SV":5.5... 2 2 {"NS":8,"NF":2,"FX":[1,0],"1":{"SF":0,"SV":4.6... 3 3 {"NS":8,"NF":2,"FX":[1,0],"1":{"SF":0,"SV":4.6... 4 4 {"NS":8,"NF":2,"FX":[1,0],"1":{"SF":0,"SV":5.3... ......
Perform predict on the fitted model:
>>> res = isof.predict(df, key='ID', features=['V000', 'V001'], contamination=0.25)
Output:
>>> res.collect() ID SCORE LABEL 0 0 0.446897 1 1 1 0.411048 1 2 2 0.498931 1 3 3 0.407796 1 4 4 0.423264 1 5 5 0.443270 1 6 6 0.619513 -1 7 7 0.638874 -1
- Attributes:
- model_DataFrame
TREE_INDEX: indicates the tree number.
MODEL_CONTENT: model content.
- error_msg_DataFrame
Error message. Only valid if
massive
is True when initializing an 'IsolationForest' instance.
Methods
fit
(data[, key, features, group_key])Train the forests with input data.
fit_predict
(data[, key, features, ...])Train the isolation forest model and returns labels for input data.
predict
(data[, key, features, ...])Obtain the anomaly score of each sample based on the given Isolation Forest model.
- fit(data, key=None, features=None, group_key=None)
Train the forests with input data.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index columnotherwise, it is assumed that
data
contains no ID column
- featuresstr or a list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-key columns.- group_keystr, optional
The column of group_key. The data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group_params are valid.
This parameter is only valid when
massive
is True.Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.
- Returns:
- A fitted object of class "IsolationForest".
- predict(data, key=None, features=None, contamination=None, thread_ratio=None, group_key=None, group_params=None)
Obtain the anomaly score of each sample based on the given Isolation Forest model.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-key columns.- contaminationfloat, optional
The proportion of outliers in the data set. Should be in the range (0, 0.5].
Defaults to 0.1.
- thread_ratiofloat, optional
Controls the proportion of available threads to be used for prediction.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to -1.
- group_keystr, optional
The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group_params are valid.
This parameter is only valid when
massive
is set as True in class instance initialization.Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.
- group_paramsdict, optional
If massive mode is activated (
massive
is set as True in class instance initialization), input data shall be divided into different groups with different parameters applied. This parameter specifies the parameter values of different groups in a dict format, where keys corresponding togroup_key
while values should be a dict for parameter value assignments.An example is as follows:
Valid only when
massive
is set as True in class instance initialization.Defaults to None.
- Returns:
- DataFrame 1
The aggregated forecasted values. Forecasted values, structured as follows:
ID, type INTEGER, ID column name.
SCORE, type DOUBLE, scoring result.
LABEL, type INTEGER, -1 for outliers and 1 for inliers.
- DataFrame 2
Error message. Only valid if
massive
is True when initializing an 'IsolationForest' instance.
- fit_predict(data, key=None, features=None, contamination=None, thread_ratio=None, group_key=None, group_params=None)
Train the isolation forest model and returns labels for input data.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-key columns.- contaminationfloat, optional
The proportion of outliers in the data set. Should be in the range (0, 0.5].
Defaults to 0.1.
- thread_ratiofloat, optional
Controls the proportion of available threads to be used for prediction.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to -1.
- group_keystr, optional
The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group_params are valid.
This parameter is only valid when
massive
is set as True in class instance initialization.Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.
- group_paramsdict, optional
If massive mode is activated (
massive
is set as True in class instance initialization), input data shall be divided into different groups with different parameters applied. This parameter specifies the parameter values of different groups in a dict format, where keys corresponding togroup_key
while values should be a dict for parameter value assignments.An example is as follows:
Valid only when
massive
is set as True in class instance initialization.Defaults to None.
- Returns:
- DataFrame 1
The aggregated forecasted values. Forecasted values, structured as follows:
ID, type INTEGER, ID column name.
SCORE, type DOUBLE, Scoring result.
LABEL, type INTEGER, -1 for outliers and 1 for inliers.
- DataFrame 2
Error message. Only valid if
massive
is True when initializing an 'IsolationForest' instance.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
Inherited Methods from PALBase
Besides those methods mentioned above, the IsolationForest class also inherits methods from PALBase class, please refer to PAL Base for more details.