hanaml.RandomForestRegressor {hana.ml.r}R Documentation

Random Forest for Regression

Description

hanaml.RandomForestRegressor is a R wrapper for PAL Random Decision Trees.

Usage

hanaml.RandomForestRegressor(conn.context, data = NULL,
                              formula = NULL,
                              features = NULL,
                              label = NULL, key = NULL,
                              n.estimators = NULL,
                              max.features = NULL,
                              max.depth = NULL,
                              min.samples.leaf = NULL,
                              split.threshold = NULL,
                              calculate.oob = TRUE,
                              random.state = NULL,
                              thread.ratio = NULL,
                              allow.missing.dependent = TRUE,
                              categorical.variable = NULL,
                              sample.fraction = NULL)

Arguments

conn.context

ConnectionContext
The connection to the SAP HANA system.

data

DataFrame
DataFrame containing the data.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> eg: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination. Do not provide both.
Defaults to NULL.

key

character, optional
Name of the ID column of data. If not provided, then data is assumed to have no ID column.

features

list of character, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, no-label columns.

label

character, optional
Name of the column in data that specifies the dependent variable. Defaults to the last no-ID column if not provided.

n.estimators

integer, optional
Specifies the number of trees in the random forest.
Defaults to 100.

max.features

integer, optional
Specifies the number of randomly selected splitting variables. Should not be larger than the number of input features. Defaults to 'sqrt(p)' (for classification) or 'p/3' (for regression), where p is the number of input features.

max.depth

integer, optional
The maximum depth of a tree.
By default it is unlimited.

min.samples.leaf

integer, optional
Specifies the minimum number of records in a leaf.
Defaults to 5 for regression.

split.threshold

double , optional
Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing.
Defaults to 1e-5.

calculate.oob

logical, optional
If TRUE, calculate the out-of-bag error. Defaults to TRUE.

random.state

integer, optional
Specifies the seed for random number generator. 0: Uses the current time (in seconds) as the seed. Others: Uses the specified value as the seed.
Defaults to 0.

thread.ratio

double, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Other values are heuistically determined.
Defaults to -1 (heuristically determined).

allow.missing.dependent

logical, optional
Specifies if a missing target value is allowed. FALSE: Not allowed. An error occurs if a missing target is present. TRUE: Allowed. The datum with a missing target is removed.

Defaults to TRUE.

categorical.variable

character or list of characters, optional
Indicates features should be treated as categorical. The behavior is dependent on what input is given. 'string': categorical.'integer' and 'double': continuous. VALID only for integer variables; omitted otherwise.
The default value is detected from input data.

sample.fraction

double, optional
The fraction of data used for training. Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training.
Defaults to 1.0.

Format

R6Class object.

Value

Return a "RandomForestClassifier" object with following values:

Note

Using Summary and Print

Summary provides a general summary of the output of the model. Usage: summary(rfr) where rfr is the model generated

Print provides information on the coefficients and the optional parameter values given by the user. Usage: print(rfr) where rfr is the model generated.

See Also

predict.RandomForestRegressor

Examples

    ## Not run: 
    Input DataFrame df for training:

    >df$Collect()
       ID         A         B         C         D       CLASS
    0   0 -0.965679  1.142985 -0.019274 -1.598807  -23.633813
    1   1  2.249528  1.459918  0.153440 -0.526423  212.532559
    2   2 -0.631494  1.484386 -0.335236  0.354313   26.342585
    3   3 -0.967266  1.131867 -0.684957 -1.397419  -62.563666
    4   4 -1.175179 -0.253179 -0.775074  0.996815 -115.534935
    ......

    Creating RandomForestRegressor instance and generating model:

    > rfr <- hanaml.RandomForestRegressor(conn.context=cc, data = df, random.state=3)


    > rfr$feature.importances$Collect()
       VARIABLE_NAME  IMPORTANCE
    0             A    0.249593
    1             B    0.381879
    2             C    0.291403
    3             D    0.077125

    Input DataFrame for scoring:

    > head(df3$Collect(),5)
        ID         A         B         C         D       CLASS
    0    0  1.081277  0.204114  1.220580 -0.750665   139.10170
    1    1  0.524813 -0.012192 -0.418597  2.946886    52.17203
    2    2 -0.280871  0.100554 -0.343715 -0.118843   -34.69829
    3    3 -0.113992 -0.045573  0.957154  0.090350    51.93602
    4    4  0.287476  1.266895  0.466325 -0.432323   106.63425
    ..

    Performing score() on given DataFrame:

    > rfr$score(data = df3, features = list("A","B", "C","D")
    0.8490768


## End(Not run)

[Package hana.ml.r version 1.0.8 Index]