R: Random Forest for Classification

hanaml.RandomForestClassifier {hana.ml.r}

R Documentation

Random Forest for Classification

Description

hanaml.RandomForestClassifier is a R wrapper for PAL Random Decision Trees.

Usage

hanaml.RandomForestClassifier(conn.context, data = NULL,
                              formula = NULL,
                              features = NULL,
                              label = NULL, key = NULL,
                              n.estimators = NULL,
                              max.features = NULL,
                              max.depth = NULL,
                              min.samples.leaf = NULL,
                              split.threshold = NULL,
                              calculate.oob = TRUE,
                              random.state = NULL,
                              thread.ratio = NULL,
                              allow.missing.dependent = TRUE,
                              categorical.variable = NULL,
                              sample.fraction = NULL,
                              strata = NULL, priors = NULL)

Arguments

`conn.context`	`ConnectionContext` The connection to the SAP HANA system.
`data`	`DataFrame` DataFrame containing the data.
`key`	`character, optional` Name of the ID column of data. If not provided, then data is assumed to have no ID column.
`features`	`list of character, optional` Names of the feature columns. If features is not provided, it defaults to all non-ID, no-label columns.
`label`	`character, optional` Name of the column in data that specifies the dependent variable. Defaults to the last no-ID column if not specified.
`formula`	`formula type, optional` Formula to be used for model generation. format = label~<feature_list> eg: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination. Do not provide both. Defaults to NULL.
`n.estimators`	`integer, optional` Specifies the number of trees in the random forest. Defaults to '100'.
`max.features`	`integer, optional` Specifies the number of randomly selected splitting variables. Should not be larger than the number of input features. Defaults to 'sqrt(p)' (for classification) or 'p/3' (for regression), where p is the number of input features.
`max.depth`	`integer, optional` The maximum depth of a tree. By default it is unlimited.
`min.samples.leaf`	`integer, optional` Specifies the minimum number of records in a leaf. Defaults to 1 for classification.
`split.threshold`	`double, optional` Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing. Defaults to 1e-5.
`calculate.oob`	`logical, optional` If TRUE, calculate the out-of-bag error. Defaults to TRUE.
`random.state`	`integer, optional` Specifies the seed for random number generator. 0: Uses the current time (in seconds) as the seed. Others: Uses the specified value as the seed. Defaults to 0.
`thread.ratio`	`double, optional` Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Other values are determined heuristically. Defaults to -1.
`allow.missing.dependent`	`logical, optional` Specifies if a missing target value is allowed. - FALSE: Not allowed. An error occurs if a missing target is present. - TRUE: Allowed. The datum with the missing target is removed. Defaults to TRUE.
`categorical.variable`	`character or list of characters, optional` Indicates features should be treated as categorical. The behavior is dependent on what input is given. 'string': categorical. 'integer' and 'double': continuous. VALID only for integer variables; omitted otherwise. The default value is detected from input data.
`sample.fraction`	`double, optional` The fraction of data used for training. Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training. Defaults to '1.0'.
`strata`	`List of tuples: (class, fraction), optional` Strata proportions for stratified sampling. A (class, fraction) tuple specifies that rows with that class should make up the specified fraction of each sample. If the given fractions do not add up to 1, the remaining portion is divided equally between classes with no entry in strata, or between all classes if all classes have an entry in strata. If strata is not provided, bagging is used instead of stratified sampling.
`priors`	`List of tuples: (class, prior_prob), optional` Prior probabilities for classes. A (class, prior_prob) tuple specifies the prior probability of this class. If the given priors do not add up to 1, the remaining portion is divided equally between classes with no entry in priors, or between all classes if all classes have an entry in 'priors'. If priors is not provided, it is determined by the proportion of every class in the training data.

Format

R6Class object.

Value

Return a "RandomForestClassifier" object with following attributes:

model : DataFrame
Trained model content.
feature.importance : DataFrame
The feature importance (the higher, the more important the feature).
oob.error : DataFrame
Out-of-bag error rate or mean squared error for random forest up to indexed tree. Set to None if calculate.oob is FALSE.
confusion.matrix : DataFrame
Confusion matrix used to evaluate the performance of classification algorithms.

Note

Using Summary and Print

Summary provides a general summary of the output of the model. Usage: summary(rfc) where rfc is the model generated

Print provides information on the coefficients and the optional parameter values given by the user. Usage: print(rfc) where rfc is the model generated.

Examples

## Not run: 
Input DataFrame df for training:
 > df$Collect()
OUTLOOK TEMP HUMIDITY WINDY       CLASS
1     Sunny   75       70   Yes        Play
2     Sunny   80       90   Yes Do not Play
3     Sunny   85       85    No Do not Play
4     Sunny   72       95    No Do not Play
5     Sunny   69       70    No        Play
6  Overcast   72       90   Yes        Play
7  Overcast   83       78    No        Play
8  Overcast   64       65   Yes        Play
9  Overcast   81       75    No        Play
10     Rain   71       80   Yes Do not Play
11     Rain   65       70   Yes Do not Play
12     Rain   75       80    No        Play
13     Rain   68       80    No        Play
14     Rain   70       96    No        Play
Creating RandomForestClassifier instance:
rfc <- hanaml.RandomForestClassifier(conn.context = conn, data = df,
                                     n.estimators=300, max.features=3,
                                     random.state=2, split.threshold=0.00001,
                                     calculate.oob=TRUE,
                                     min.samples.leaf=1, thread.ratio=1.0)

Giving features and labels as input to generating a model:
rfc <- hanaml.RandomForestClassifier(conn.context = conn, data = df,
                            key = NULL, n.estimators=300, max.features=3,
                            features = list('TEMP', 'HUMIDITY', 'WINDY'),
                            label = "CLASS",
                            random.state=2, split.threshold=0.00001,
                            calculate.oob=TRUE,
                            min.samples.leaf=1, thread.ratio=1.0)

Giving input to model generation as a formula:
rfc <- hanaml.RandomForestClassifier(conn.context = conn, data = df,
                                     n.estimators=300, max.features=3,
                                     formula=CATEGORY~V1+V2+V3,
                                     random.state=2, split.threshold=0.00001,
                                     calculate.oob=TRUE,
                                     min.samples.leaf=1, thread.ratio=1.0)

> rfc$feature.importances$Collect()
 VARIABLE_NAME IMPORTANCE
1       OUTLOOK  0.3475185
2          TEMP  0.2770724
3      HUMIDITY  0.2476346
4         WINDY  0.1277744

Input DataFrame for scoring:

> df3$Collect()
ID  OUTLOOK TEMP HUMIDITY WINDY       CLASS
1  0    Sunny   75       70   Yes        Play
2  1    Sunny   NA       90   Yes Do not Play
3  2    Sunny   85       NA    No Do not Play
4  3    Sunny   72       95    No Do not Play
5  4     <NA>   NA       70  <NA>        Play
6  5 Overcast   72       90   Yes        Play
7  6 Overcast   83       78    No        Play
8  7 Overcast   64       65   Yes Do not Play
9  8 Overcast   81       75    No        Play

Performing score() on given DataFrame:

> dtc$score(df3)
0.8932412


## End(Not run)

[Package hana.ml.r version 1.0.8 Index]