R: Random Forest for Regression

hanaml.RandomForestRegressor {hana.ml.r}

R Documentation

Random Forest for Regression

Description

hanaml.RandomForestRegressor is a R wrapper for PAL Random Decision Trees.

Usage

hanaml.RandomForestRegressor(conn.context, data = NULL,
                              formula = NULL,
                              features = NULL,
                              label = NULL, key = NULL,
                              n.estimators = NULL,
                              max.features = NULL,
                              max.depth = NULL,
                              min.samples.leaf = NULL,
                              split.threshold = NULL,
                              calculate.oob = TRUE,
                              random.state = NULL,
                              thread.ratio = NULL,
                              allow.missing.dependent = TRUE,
                              categorical.variable = NULL,
                              sample.fraction = NULL)

Arguments

`conn.context`	`ConnectionContext` The connection to the SAP HANA system.
`data`	`DataFrame` DataFrame containing the data.
`formula`	`formula type, optional` Formula to be used for model generation. format = label~<feature_list> eg: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination. Do not provide both. Defaults to NULL.
`key`	`character, optional` Name of the ID column of data. If not provided, then data is assumed to have no ID column.
`features`	`list of character, optional` Names of the feature columns. If features is not provided, it defaults to all non-ID, no-label columns.
`label`	`character, optional` Name of the column in data that specifies the dependent variable. Defaults to the last no-ID column if not provided.
`n.estimators`	`integer, optional` Specifies the number of trees in the random forest. Defaults to 100.
`max.features`	`integer, optional` Specifies the number of randomly selected splitting variables. Should not be larger than the number of input features. Defaults to 'sqrt(p)' (for classification) or 'p/3' (for regression), where p is the number of input features.
`max.depth`	`integer, optional` The maximum depth of a tree. By default it is unlimited.
`min.samples.leaf`	`integer, optional` Specifies the minimum number of records in a leaf. Defaults to 5 for regression.
`split.threshold`	`double , optional` Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing. Defaults to 1e-5.
`calculate.oob`	`logical, optional` If TRUE, calculate the out-of-bag error. Defaults to TRUE.
`random.state`	`integer, optional` Specifies the seed for random number generator. 0: Uses the current time (in seconds) as the seed. Others: Uses the specified value as the seed. Defaults to 0.
`thread.ratio`	`double, optional` Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Other values are heuistically determined. Defaults to -1 (heuristically determined).
`allow.missing.dependent`	`logical, optional` Specifies if a missing target value is allowed. FALSE: Not allowed. An error occurs if a missing target is present. TRUE: Allowed. The datum with a missing target is removed. Defaults to TRUE.
`categorical.variable`	`character or list of characters, optional` Indicates features should be treated as categorical. The behavior is dependent on what input is given. 'string': categorical.'integer' and 'double': continuous. VALID only for integer variables; omitted otherwise. The default value is detected from input data.
`sample.fraction`	`double, optional` The fraction of data used for training. Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training. Defaults to 1.0.

Format

R6Class object.

Value

Return a "RandomForestClassifier" object with following values:

model : DataFrame
Trained model content.
feature.importance : DataFrame
The feature importance (the higher, the more important the feature).
oob.error : DataFrame
Out-of-bag error rate or mean squared error for random forest up to indexed tree. Set to None if calculate_oob is FALSE.

Note

Using Summary and Print

Summary provides a general summary of the output of the model. Usage: summary(rfr) where rfr is the model generated

Print provides information on the coefficients and the optional parameter values given by the user. Usage: print(rfr) where rfr is the model generated.

Examples

    ## Not run: 
    Input DataFrame df for training:

    >df$Collect()
       ID         A         B         C         D       CLASS
    0   0 -0.965679  1.142985 -0.019274 -1.598807  -23.633813
    1   1  2.249528  1.459918  0.153440 -0.526423  212.532559
    2   2 -0.631494  1.484386 -0.335236  0.354313   26.342585
    3   3 -0.967266  1.131867 -0.684957 -1.397419  -62.563666
    4   4 -1.175179 -0.253179 -0.775074  0.996815 -115.534935
    ......

    Creating RandomForestRegressor instance and generating model:

    > rfr <- hanaml.RandomForestRegressor(conn.context=cc, data = df, random.state=3)


    > rfr$feature.importances$Collect()
       VARIABLE_NAME  IMPORTANCE
    0             A    0.249593
    1             B    0.381879
    2             C    0.291403
    3             D    0.077125

    Input DataFrame for scoring:

    > head(df3$Collect(),5)
        ID         A         B         C         D       CLASS
    0    0  1.081277  0.204114  1.220580 -0.750665   139.10170
    1    1  0.524813 -0.012192 -0.418597  2.946886    52.17203
    2    2 -0.280871  0.100554 -0.343715 -0.118843   -34.69829
    3    3 -0.113992 -0.045573  0.957154  0.090350    51.93602
    4    4  0.287476  1.266895  0.466325 -0.432323   106.63425
    ..

    Performing score() on given DataFrame:

    > rfr$score(data = df3, features = list("A","B", "C","D")
    0.8490768


## End(Not run)

[Package hana.ml.r version 1.0.8 Index]