R: Decision Tree Model for Regression

hanaml.DecisionTreeClassifier {hana.ml.r}

R Documentation

Decision Tree Model for Regression

Description

hanaml.DecisionTreeClassifier is a R wrapper for PAL Decision tree.

Usage

hanaml.DecisionTreeClassifier (conn.context, algorithm,
                              data = NULL,
                              key = NULL,
                              features = NULL,
                              label = NULL,
                              formula = NULL,
                              thread.ratio = NULL,
                              allow.missing.dependent = NULL, percentage = NULL,
                              min.records.of.parent = NULL,
                              min.records.of.leaf = NULL, max.depth = NULL,
                              categorical.variable = NULL,
                              split.threshold = NULL, use.surrogate = NULL,
                              model.format = NULL,
                              discretization.type = NULL,
                              bins = NULL, max.branch = NULL,
                              merge.threshold = NULL,
                              priors = NULL, output.rules = NULL,
                              output.confusion.matrix = NULL)

Arguments

`conn.context`	`ConnectionContext` The connection to the SAP HANA system.
`algorithm`	`character` Algorithm used to grow a decision tree. Valid values are c45, chaid, and cart.
`data`	`DataFrame` DataFrame containing the data.
`key`	`character, optional` Name of the ID column of data. If not provided, then data is assumed to have no ID column.
`features`	`list of character, optional` Names of the feature columns. If features is not provided, it defaults to all non-ID, no-label columns.
`label`	`character, optional` Name of the column in data that specifies the dependent variable. Defaults to the last non-ID column if not specified.
`formula`	`formula type, optional` Formula to be used for model generation. format = label~<feature_list> eg: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination. Do not provide both. Defaults to NULL.
`thread.ratio`	`double, optional` Controls the proportion of available threads that can be used. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Other values are heuristically determined. Defaults to -1.
`allow.missing.dependent`	`logical, optional` Specifies if a missing target value is allowed. FALSE does not allow the missing target value. An error occurs if a missing target is present. TRUE allows the missing target value. The datum with the missing target is removed. #' Defaults to TRUE.
`percentage`	`double, optional` Specifies the percentage of the input data that will be used to build the tree model. The rest of the data will be used for pruning. Defaults to 1.0.
`min.records.of.parent`	`integer, optional` Specifies the stop condition. If the number of records in one node is less than the specified value, the algorithm stops splitting. Defaults to 2.
`min.records.of.leaf`	`integer, optional` Promises the minimum number of records in a leaf. Defaults to 1.
`max.depth`	`integer, optional` The maximum depth of a tree. By default it is unlimited.
`categorical.variable`	`character or list of characters` Indicates features should be treated as categorical. The behavior is dependent on what input is given. 'string': categorical. 'integer' and 'double': continuous. VALID only for integer variables; omitted otherwise. The default value is detected from input data.
`split.threshold`	`double, optional` Specifies the stop condition for a node. `C45` - The information gain ratio of the best split is less than this value. `CHAID` - The p-value of the best split is greater than or equal to this value. `CART` - The reduction of Gini index or relative MSE of the best split is less than this value. The smaller the SPLIT_THRESHOLD value is, the larger a C45 or CART tree grows. On the contrary, CHAID will grow a larger tree with a larger SPLIT_THRESHOLD value. Defaults to 1e-5 for C45 and CART, and 0.05 for CHAID.
`discretization.type`	`character, optional` Specifies the strategy for discretizing continuous attributes. Valid options are mdlpc and equal_freq. Valid only for C45 and CHAID. Defaults to 'mdlpc'.
`bins`	`list` Specifies the number of bins for discretization in list. Each element in the list must be named, with the name being a column name, and the values be the number of bins for discretizing that column. Only valid when discretizaition type is "equal_freq". Defaults to '10' for each column.
`max.branch`	`integer, optional` Specifies the maximum number of branches. Valid only for CHAID. Defaults to '10'.
`merge.threshold`	`double, optional` Specifies the merge condition for CHAID. If the metric value is greater than or equal to the specified value, the algorithm will merge the two branches. Only valid for CHAID. Defaults to '0.05'.
`use.surrogate`	`logical, optional` Indicates whether to use surrogate split when NULL values are encountered. FALSE does not use surrogate split. TRUE will use a surrogate split. Only valid for CART. Defaults to 'TRUE'.
`model.format`	`character, optional` Specifies the tree model format for store. Valid options are json and pmml. Defaults to 'json'.
`output.rules`	`logical, optional` Specifies whether to output decision rules or not. FALSE will not output decision rules. TRUE will output decision rules. Defaults to TRUE.
`priors`	`list of tuples: (class, prior_probability)` Specifies the priori probability of every class label. The default value is determined from the data.
`output.confusion.matrix`	`logical, optional` Specifies whether or not to produce an output confusion matrix. FALSE will not output a confusion matrix. TRUE will output confusion matrix. Defaults to TRUE.

Format

R6Class object.

Value

A "DecisionTreeClassifier" object with the following attributes:

model: DataFrame
Trained model content.
decision.rules: DataFrame
Rules for decision tree to make decisions.
confusion.matrix: DataFrame
Confusion matrix used to evaluate the performance of classification algorithms.

Note

Using Summary and Print

Summary provides a general summary of the output of the model.
Usage: summary(dtc) where dtc is the model generated

Print provides information on the coefficients and the optional parameter values given by the user.
Usage: print(dtc) where dtc is the model generated.

Examples

## Not run: 
Input DataFrame for training:
> data$Collect()
OUTLOOK TEMP HUMIDITY WINDY       CLASS
1     Sunny   75       70   Yes        Play
2     Sunny   80       90   Yes Do not Play
3     Sunny   85       85    No Do not Play
4     Sunny   72       95    No Do not Play
5     Sunny   69       70    No        Play
6  Overcast   72       90   Yes        Play
7  Overcast   83       78    No        Play
8  Overcast   64       65   Yes        Play
9  Overcast   81       75    No        Play
0     Rain   71       80   Yes Do not Play
1     Rain   65       70   Yes Do not Play
2     Rain   75       80    No        Play
3     Rain   68       80    No        Play
4     Rain   70       96    No        Play

Creating DecisionTreeClassifier model:
dtc = hanaml.DecisionTreeClassifier( conn,   algorithm = 'c45', data = data,
                                features = list('TEMP', 'HUMIDITY', 'WINDY'),
                                label = "CLASS", key= NULL
                                min.records.of.parent = 2, min.records.of.leaf = 1,
                                thread.ratio = 0.4, split.threshold = 1e-5,
                                 model.format = 'json',  output.rules = TRUE )

Giving input to create a model as a formula:
dtc = hanaml.DecisionTreeClassifier( conn,   algorithm = 'c45', data = data,
                                formula=CATEGORY~V1+V2+V3, key= "ID"
                                min.records.of.parent = 2, min.records.of.leaf = 1,
                                thread.ratio = 0.4, split.threshold = 1e-5,
                                 model.format = 'json',  output.rules = TRUE )
> dtc$decision.rules$Collect()
  ROW_INDEX                                                    RULES_CONTENT
0          0                                        (TEMP>=84) => Do not Play
1          1                          (TEMP<84) && (OUTLOOK=Overcast) => Play
2          2          (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY<82.5) => Play
3          3  (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY>=82.5) => Do not Play
4          4        (TEMP<84) && (OUTLOOK=Rain) && (WINDY=Yes) => Do not Play
5          5                (TEMP<84) && (OUTLOOK=Rain) && (WINDY=No) => Play

Input DataFrame for predicting:

> data2$Collect()
  ID   OUTLOOK  HUMIDITY  TEMP WINDY
0   0  Overcast      75.0    70   Yes
1   1      Rain      78.0    70   Yes
2   2     Sunny      66.0    70   Yes
3   3     Sunny      69.0    70   Yes
4   4      Rain       NaN    70   Yes
5   5      None      70.0    70   Yes
6   6       ***      70.0    70   Yes

Performing predict() on given DataFrame:

> result = predict(dtc,data2, verbose=FALSE)
  ID        SCORE  CONFIDENCE
0   0         Play    1.000000
1   1  Do not Play    1.000000
2   2         Play    1.000000
3   3         Play    1.000000
4   4  Do not Play    1.000000
5   5         Play    0.692308
6   6         Play    0.692308

here:
dtc is the model generated
data2 is the DataFrame to predict from.

Input DataFrame for scoring:

DecisionTreeRegressor data3$Collect()
ID  OUTLOOK TEMP HUMIDITY WINDY       CLASS
 0    Sunny   75       70   Yes        Play
 1    Sunny   NA       90   Yes Do not Play
 2    Sunny   85       NA    No Do not Play
 3    Sunny   72       95    No Do not Play
 4     <NA>   NA       70  <NA>        Play
 5 Overcast   72       90   Yes        Play
 6 Overcast   83       78    No        Play
 7 Overcast   64       65   Yes Do not Play
 8 Overcast   81       75    No        Play

Performing score() on given DataFrame:

> dtc$score(data3)
0.75


## End(Not run)

[Package hana.ml.r version 1.0.8 Index]