IsolationForest
- class hana_ml.algorithms.pal.preprocessing.IsolationForest(n_estimators=None, max_samples=None, max_features=None, bootstrap=None, random_state=None, thread_ratio=None)
Isolation Forest generates the anomaly score of each sample.
- Parameters
- n_estimatorsint, optional
Specifies the number of trees to grow.
Default to 100.
- max_samplesint, optional
Specifies the number of samples to draw from input to train each tree. If
max_samples
is larger than the number of samples provided, all samples will be used for all trees.Default to 256.
- max_featuresint, optional
Specifies the number of features to draw from input to train each tree. 0 means no sampling.
Default to 0.
- bootstrapbool, optional
Specifies sampling method.
False: Sampling without replacement.
True: Sampling with replacement.
Default to False.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time (in second) as seed.
Others: Uses the specified value as seed.
Default to 0.
- thread_ratiofloat, optional
The ratio of available threads.
0: single thread.
0~1: percentage.
Others: heuristically determined.
Default to -1.
Examples
Input dataframe df:
>>> df.collect() ID V000 V001 0 0 -2.0 -1.0 1 1 -1.0 -1.0 2 2 -1.0 -2.0 3 3 1.0 1.0 4 4 1.0 2.0 5 5 2.0 1.0 6 6 6.0 3.0 7 7 -4.0 7.0
Create an Isolation object:
>>> isof = IsolationForest(random_state=2, thread_ratio=0)
Perform fit on the given data:
>>> isof.fit(df, key='ID', features=['V000', 'V001'])
Output:
>>> isof.model_.collect(): TREE_INDEX MODEL_CONTENT 0 0 {"NS":8,"NF":2,"FX":[0,1],"1":{"SF":0,"SV":5.5... 1 1 {"NS":8,"NF":2,"FX":[0,1],"1":{"SF":0,"SV":5.5... 2 2 {"NS":8,"NF":2,"FX":[1,0],"1":{"SF":0,"SV":4.6... 3 3 {"NS":8,"NF":2,"FX":[1,0],"1":{"SF":0,"SV":4.6... 4 4 {"NS":8,"NF":2,"FX":[1,0],"1":{"SF":0,"SV":5.3... ......
Perform predict on the fitted model:
>>> res = isof.predict(df, key='ID', features=['V000', 'V001'], contamination=0.25)
Output:
>>> res.collect() ID SCORE LABEL 0 0 0.446897 1 1 1 0.411048 1 2 2 0.498931 1 3 3 0.407796 1 4 4 0.423264 1 5 5 0.443270 1 6 6 0.619513 -1 7 7 0.638874 -1
- Attributes
- model_DataFrame
TREE_INDEX: type INTEGER, indicates the tree number.
MODEL_CONTENT: type NCLOB, Model content.
Methods
fit
(data[, key, features])Trained the forests with input data.
fit_predict
(data[, key, features, ...])Perform fit on data and returns labels for X (1 for inliers, -1 for outliers.).
predict
(data[, key, features, ...])Makes time series forecast based on the LSTM model.
- fit(data, key=None, features=None)
Trained the forests with input data.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index columnotherwise, it is assumed that
data
contains no ID column
- featuresstr or a list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-key columns.
- Returns
- A fitted object of class "IsolationForest".
- predict(data, key=None, features=None, contamination=None, thread_ratio=None)
Makes time series forecast based on the LSTM model.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-key columns.- contaminationfloat, optional
The proportion of outliers in the data set. Should be in the range (0, 0.5].
Defaults to 0.1.
- thread_ratiofloat, optional
Controls the proportion of available threads to be used for prediction.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to -1.
- Returns
- DataFrame
The aggregated forecasted values. Forecasted values, structured as follows:
ID, type INTEGER, ID column name.
SCORE, type DOUBLE, Scoring result.
LABEL, type INTEGER, -1 for outliers and 1 for inliers..
- fit_predict(data, key=None, features=None, contamination=None, thread_ratio=None)
Perform fit on data and returns labels for X (1 for inliers, -1 for outliers.).
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-key columns.- contaminationfloat, optional
The proportion of outliers in the data set. Should be in the range (0, 0.5].
Defaults to 0.1.
- thread_ratiofloat, optional
Controls the proportion of available threads to be used for prediction.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to -1.
- Returns
- DataFrame
The aggregated forecasted values. Forecasted values, structured as follows:
ID, type INTEGER, ID column name.
SCORE, type DOUBLE, Scoring result.
LABEL, type INTEGER, -1 for outliers and 1 for inliers..
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.