hanaml.DecisionTreeClassifier {hana.ml.r}R Documentation

Decision Tree Model for Regression

Description

hanaml.DecisionTreeClassifier is a R wrapper for PAL Decision tree.

Usage

hanaml.DecisionTreeClassifier (conn.context, algorithm,
                              data = NULL,
                              key = NULL,
                              features = NULL,
                              label = NULL,
                              formula = NULL,
                              thread.ratio = NULL,
                              allow.missing.dependent = NULL, percentage = NULL,
                              min.records.of.parent = NULL,
                              min.records.of.leaf = NULL, max.depth = NULL,
                              categorical.variable = NULL,
                              split.threshold = NULL, use.surrogate = NULL,
                              model.format = NULL,
                              discretization.type = NULL,
                              bins = NULL, max.branch = NULL,
                              merge.threshold = NULL,
                              priors = NULL, output.rules = NULL,
                              output.confusion.matrix = NULL)

Arguments

conn.context

ConnectionContext
The connection to the SAP HANA system.

algorithm

character
Algorithm used to grow a decision tree. Valid values are c45, chaid, and cart.

data

DataFrame
DataFrame containing the data.

key

character, optional
Name of the ID column of data. If not provided, then data is assumed to have no ID column.

features

list of character, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, no-label columns.

label

character, optional
Name of the column in data that specifies the dependent variable. Defaults to the last non-ID column if not specified.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> eg: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination. Do not provide both.
Defaults to NULL.

thread.ratio

double, optional
Controls the proportion of available threads that can be used. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Other values are heuristically determined.
Defaults to -1.

allow.missing.dependent

logical, optional
Specifies if a missing target value is allowed. FALSE does not allow the missing target value. An error occurs if a missing target is present. TRUE allows the missing target value. The datum with the missing target is removed.
#' Defaults to TRUE.

percentage

double, optional
Specifies the percentage of the input data that will be used to build the tree model. The rest of the data will be used for pruning.

Defaults to 1.0.

min.records.of.parent

integer, optional
Specifies the stop condition. If the number of records in one node is less than the specified value, the algorithm stops splitting.

Defaults to 2.

min.records.of.leaf

integer, optional
Promises the minimum number of records in a leaf.
Defaults to 1.

max.depth

integer, optional
The maximum depth of a tree. By default it is unlimited.

categorical.variable

character or list of characters
Indicates features should be treated as categorical. The behavior is dependent on what input is given. 'string': categorical. 'integer' and 'double': continuous. VALID only for integer variables; omitted otherwise.
The default value is detected from input data.

split.threshold

double, optional
Specifies the stop condition for a node.

  • C45 - The information gain ratio of the best split is less than this value.

  • CHAID - The p-value of the best split is greater than or equal to this value.

  • CART - The reduction of Gini index or relative MSE of the best split is less than this value.

The smaller the SPLIT_THRESHOLD value is, the larger a C45 or CART tree grows. On the contrary, CHAID will grow a larger tree with a larger SPLIT_THRESHOLD value.

Defaults to 1e-5 for C45 and CART, and 0.05 for CHAID.

discretization.type

character, optional
Specifies the strategy for discretizing continuous attributes. Valid options are mdlpc and equal_freq. Valid only for C45 and CHAID.

Defaults to 'mdlpc'.

bins

list
Specifies the number of bins for discretization in list. Each element in the list must be named, with the name being a column name, and the values be the number of bins for discretizing that column.
Only valid when discretizaition type is "equal_freq".

Defaults to '10' for each column.

max.branch

integer, optional
Specifies the maximum number of branches. Valid only for CHAID.

Defaults to '10'.

merge.threshold

double, optional
Specifies the merge condition for CHAID. If the metric value is greater than or equal to the specified value, the algorithm will merge the two branches. Only valid for CHAID.

Defaults to '0.05'.

use.surrogate

logical, optional
Indicates whether to use surrogate split when NULL values are encountered. FALSE does not use surrogate split. TRUE will use a surrogate split. Only valid for CART.

Defaults to 'TRUE'.

model.format

character, optional
Specifies the tree model format for store. Valid options are json and pmml.

Defaults to 'json'.

output.rules

logical, optional
Specifies whether to output decision rules or not. FALSE will not output decision rules. TRUE will output decision rules.

Defaults to TRUE.

priors

list of tuples: (class, prior_probability)
Specifies the priori probability of every class label.
The default value is determined from the data.

output.confusion.matrix

logical, optional
Specifies whether or not to produce an output confusion matrix. FALSE will not output a confusion matrix. TRUE will output confusion matrix.

Defaults to TRUE.

Format

R6Class object.

Value

A "DecisionTreeClassifier" object with the following attributes:

Note

Using Summary and Print

Summary provides a general summary of the output of the model.
Usage: summary(dtc) where dtc is the model generated

Print provides information on the coefficients and the optional parameter values given by the user.
Usage: print(dtc) where dtc is the model generated.

Examples

## Not run: 
Input DataFrame for training:
> data$Collect()
OUTLOOK TEMP HUMIDITY WINDY       CLASS
1     Sunny   75       70   Yes        Play
2     Sunny   80       90   Yes Do not Play
3     Sunny   85       85    No Do not Play
4     Sunny   72       95    No Do not Play
5     Sunny   69       70    No        Play
6  Overcast   72       90   Yes        Play
7  Overcast   83       78    No        Play
8  Overcast   64       65   Yes        Play
9  Overcast   81       75    No        Play
0     Rain   71       80   Yes Do not Play
1     Rain   65       70   Yes Do not Play
2     Rain   75       80    No        Play
3     Rain   68       80    No        Play
4     Rain   70       96    No        Play

Creating DecisionTreeClassifier model:
dtc = hanaml.DecisionTreeClassifier( conn,   algorithm = 'c45', data = data,
                                features = list('TEMP', 'HUMIDITY', 'WINDY'),
                                label = "CLASS", key= NULL
                                min.records.of.parent = 2, min.records.of.leaf = 1,
                                thread.ratio = 0.4, split.threshold = 1e-5,
                                 model.format = 'json',  output.rules = TRUE )

Giving input to create a model as a formula:
dtc = hanaml.DecisionTreeClassifier( conn,   algorithm = 'c45', data = data,
                                formula=CATEGORY~V1+V2+V3, key= "ID"
                                min.records.of.parent = 2, min.records.of.leaf = 1,
                                thread.ratio = 0.4, split.threshold = 1e-5,
                                 model.format = 'json',  output.rules = TRUE )
> dtc$decision.rules$Collect()
  ROW_INDEX                                                    RULES_CONTENT
0          0                                        (TEMP>=84) => Do not Play
1          1                          (TEMP<84) && (OUTLOOK=Overcast) => Play
2          2          (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY<82.5) => Play
3          3  (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY>=82.5) => Do not Play
4          4        (TEMP<84) && (OUTLOOK=Rain) && (WINDY=Yes) => Do not Play
5          5                (TEMP<84) && (OUTLOOK=Rain) && (WINDY=No) => Play

Input DataFrame for predicting:

> data2$Collect()
  ID   OUTLOOK  HUMIDITY  TEMP WINDY
0   0  Overcast      75.0    70   Yes
1   1      Rain      78.0    70   Yes
2   2     Sunny      66.0    70   Yes
3   3     Sunny      69.0    70   Yes
4   4      Rain       NaN    70   Yes
5   5      None      70.0    70   Yes
6   6       ***      70.0    70   Yes

Performing predict() on given DataFrame:

> result = predict(dtc,data2, verbose=FALSE)
  ID        SCORE  CONFIDENCE
0   0         Play    1.000000
1   1  Do not Play    1.000000
2   2         Play    1.000000
3   3         Play    1.000000
4   4  Do not Play    1.000000
5   5         Play    0.692308
6   6         Play    0.692308

here:
dtc is the model generated
data2 is the DataFrame to predict from.

Input DataFrame for scoring:

DecisionTreeRegressor data3$Collect()
ID  OUTLOOK TEMP HUMIDITY WINDY       CLASS
 0    Sunny   75       70   Yes        Play
 1    Sunny   NA       90   Yes Do not Play
 2    Sunny   85       NA    No Do not Play
 3    Sunny   72       95    No Do not Play
 4     <NA>   NA       70  <NA>        Play
 5 Overcast   72       90   Yes        Play
 6 Overcast   83       78    No        Play
 7 Overcast   64       65   Yes Do not Play
 8 Overcast   81       75    No        Play

Performing score() on given DataFrame:

> dtc$score(data3)
0.75


## End(Not run)


[Package hana.ml.r version 1.0.8 Index]