| hanaml.RandomForestClassifier {hana.ml.r} | R Documentation |
hanaml.RandomForestClassifier is a R wrapper for PAL Random Decision Trees.
hanaml.RandomForestClassifier(conn.context, data = NULL,
formula = NULL,
features = NULL,
label = NULL, key = NULL,
n.estimators = NULL,
max.features = NULL,
max.depth = NULL,
min.samples.leaf = NULL,
split.threshold = NULL,
calculate.oob = TRUE,
random.state = NULL,
thread.ratio = NULL,
allow.missing.dependent = TRUE,
categorical.variable = NULL,
sample.fraction = NULL,
strata = NULL, priors = NULL)
conn.context |
|
data |
|
key |
|
features |
|
label |
|
formula |
|
n.estimators |
Defaults to '100'. |
max.features |
Defaults to 'sqrt(p)' (for classification) or 'p/3' (for regression), where p is the number of input features. |
max.depth |
By default it is unlimited. |
min.samples.leaf |
|
split.threshold |
Defaults to 1e-5. |
calculate.oob |
|
random.state |
|
thread.ratio |
Defaults to -1. |
allow.missing.dependent |
|
categorical.variable |
|
sample.fraction |
|
strata |
|
priors |
|
R6Class object.
Return a "RandomForestClassifier" object with following attributes:
model : DataFrame
Trained model content.
feature.importance : DataFrame
The feature importance (the higher, the more important the feature).
oob.error : DataFrame
Out-of-bag error rate or mean squared error for random forest up
to indexed tree.
Set to None if calculate.oob is FALSE.
confusion.matrix : DataFrame
Confusion matrix used to evaluate the performance of
classification algorithms.
Using Summary and Print
Summary provides a general summary of the output of the model. Usage: summary(rfc) where rfc is the model generated
Print provides information on the coefficients and the optional parameter values given by the user. Usage: print(rfc) where rfc is the model generated.
predict.RandomForestClassifier
## Not run:
Input DataFrame df for training:
> df$Collect()
OUTLOOK TEMP HUMIDITY WINDY CLASS
1 Sunny 75 70 Yes Play
2 Sunny 80 90 Yes Do not Play
3 Sunny 85 85 No Do not Play
4 Sunny 72 95 No Do not Play
5 Sunny 69 70 No Play
6 Overcast 72 90 Yes Play
7 Overcast 83 78 No Play
8 Overcast 64 65 Yes Play
9 Overcast 81 75 No Play
10 Rain 71 80 Yes Do not Play
11 Rain 65 70 Yes Do not Play
12 Rain 75 80 No Play
13 Rain 68 80 No Play
14 Rain 70 96 No Play
Creating RandomForestClassifier instance:
rfc <- hanaml.RandomForestClassifier(conn.context = conn, data = df,
n.estimators=300, max.features=3,
random.state=2, split.threshold=0.00001,
calculate.oob=TRUE,
min.samples.leaf=1, thread.ratio=1.0)
Giving features and labels as input to generating a model:
rfc <- hanaml.RandomForestClassifier(conn.context = conn, data = df,
key = NULL, n.estimators=300, max.features=3,
features = list('TEMP', 'HUMIDITY', 'WINDY'),
label = "CLASS",
random.state=2, split.threshold=0.00001,
calculate.oob=TRUE,
min.samples.leaf=1, thread.ratio=1.0)
Giving input to model generation as a formula:
rfc <- hanaml.RandomForestClassifier(conn.context = conn, data = df,
n.estimators=300, max.features=3,
formula=CATEGORY~V1+V2+V3,
random.state=2, split.threshold=0.00001,
calculate.oob=TRUE,
min.samples.leaf=1, thread.ratio=1.0)
> rfc$feature.importances$Collect()
VARIABLE_NAME IMPORTANCE
1 OUTLOOK 0.3475185
2 TEMP 0.2770724
3 HUMIDITY 0.2476346
4 WINDY 0.1277744
Input DataFrame for scoring:
> df3$Collect()
ID OUTLOOK TEMP HUMIDITY WINDY CLASS
1 0 Sunny 75 70 Yes Play
2 1 Sunny NA 90 Yes Do not Play
3 2 Sunny 85 NA No Do not Play
4 3 Sunny 72 95 No Do not Play
5 4 <NA> NA 70 <NA> Play
6 5 Overcast 72 90 Yes Play
7 6 Overcast 83 78 No Play
8 7 Overcast 64 65 Yes Do not Play
9 8 Overcast 81 75 No Play
Performing score() on given DataFrame:
> dtc$score(df3)
0.8932412
## End(Not run)