Decision Tree Model for Classficiation

hanaml.DecisionTreeClassifier is a R wrapper for SAP HANA PAL Decision tree.

hanaml.DecisionTreeClassifier(
  algorithm = NULL,
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  formula = NULL,
  thread.ratio = NULL,
  allow.missing.dependent = NULL,
  percentage = NULL,
  min.records.of.parent = NULL,
  min.records.of.leaf = NULL,
  max.depth = NULL,
  categorical.variable = NULL,
  split.threshold = NULL,
  use.surrogate = NULL,
  model.format = NULL,
  discretization.type = NULL,
  bins = NULL,
  max.branch = NULL,
  merge.threshold = NULL,
  priors = NULL,
  output.rules = NULL,
  output.confusion.matrix = NULL,
  evaluation.metric = NULL,
  parameter.range = NULL,
  parameter.values = NULL,
  resampling.method = NULL,
  repeat.times = NULL,
  fold.num = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  timeout = NULL,
  progress.indicator.id = NULL
)

Arguments

algorithm

character
Algorithm used to grow a decision tree.
Valid values are "c45", "chaid", and "cart".

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula, or a feature and label combination, but do not provide both.
Defaults to NULL.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined.
Defaults to -1.

allow.missing.dependent

logical, optional
Specifies if a missing target value is allowed. FALSE does not allow the missing target value. An error occurs if a missing target is present. TRUE allows the missing target value. The datum with the missing target is removed.
Defaults to TRUE.

percentage

double, optional
Specifies the percentage of the input data that will be used to build the tree model. The rest of the data will be used for pruning.
Defaults to 1.0.

min.records.of.parent

integer, optional
Specifies the stop condition. If the number of records in one node is less than the specified value, the algorithm stops splitting.
Defaults to 2.

min.records.of.leaf

integer, optional
Promises the minimum number of records in a leaf.
Defaults to 1.

max.depth

integer, optional
The maximum depth of a tree. By default it is unlimited.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

split.threshold

double, optional
Specifies the stop condition for a node.

"c45" - The information gain ratio of the best split is less than this value.
"chaid" - The p-value of the best split is greater than or equal to this value.
"cart" - The reduction of Gini index or relative MSE of the best split is less than this value.

The smaller the threshold value is, the larger a c45 or cart tree grows. On the contrary, chaid will grow a larger tree with a larger threshold value.
Defaults to 1e-5 for c45 and cart, and 0.05 for chaid.

use.surrogate

logical, optional
Indicates whether to use surrogate split when NULL values are encountered.
FALSE does not use surrogate split. TRUE will use a surrogate split. Only valid for cart.
Defaults to TRUE.

model.format

character, optional
Specifies the tree model format for store. Valid options are "json" and "pmml".
Defaults to "json".

discretization.type

character, optional
Specifies the strategy for discretizing continuous attributes. Valid options are "mdlpc" and "equal.freq". Valid only for C45 and chaid.
Defaults to "mdlpc".

bins

list
Specifies the number of bins for discretization in list. Each element in the list must be named, with the name being a column name, and the values be the number of bins for discretizing that column.
Only valid when discretization type is "equal_freq".
Defaults to '10' for each column.

max.branch

integer, optional
Specifies the maximum number of branches. Valid only for chaid.
Defaults to 10.

merge.threshold

double, optional
Specifies the merge condition for chaid. If the metric value is greater than or equal to the specified value, the algorithm will merge the two branches. Only valid for chaid.
Defaults to 0.05.

priors

named list/vector of numerics, optional
Specifies the priori probability of every class label(in the form of class.label = probability).
The default value is determined from the data.

output.rules

logical, optional
Specifies whether to output decision rules or not. FALSE will not output decision rules. TRUE will output decision rules.
Defaults to TRUE.

output.confusion.matrix

logical, optional
Specifies whether or not to produce an output confusion matrix. FALSE will not output a confusion matrix. TRUE will output confusion matrix.
Defaults to TRUE.

evaluation.metric

c("error_rate", "nll", "auc"), optional
Specifies the evaluation metric for model evaluation or parameter selection.
Defaults to "error_rate".

parameter.range

list, optional
Specifies range of the following parameters for parameter selection:
min.records.of.leaf, min.records.of.parent, max.depth, split.threshold, max.branch, merge.threshold.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(max.depth = c(10, 1, 20)).
If param.search.strategy is 'random', then step has no effect and thus can be omitted.

parameter.values

list, optional
Specifies values of the following parameters for parameter selection:
discretization.type, min.records.of.leaf, min.records.of.parent, max.depth, split.threshold, max.branch, merge.threshold.

resampling.method

character, optional
specifies the resampling method for model evaluation or parameter selection.
Valid options include: "cv", "stratified_cv", "bootstrap", "stratified_bootstrap".
If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.

repeat.times

numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.

fold.num

integer, optional
Specifies the fold number for the cross-validation(cv). Mandatory and valid only when resampling.method is "cv" or "stratified_cv".

param.search.strategy

c("grid", "random"), optional
Specifies the method to activate parameter selection. If not specified, model selection shall not be triggered.

random.search.times

integer, optional
Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid only when param.search.strategy is "random".

timeout

integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds. No timeout when 0 is specified.

progress.indicator.id

character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.

Value

An R6 object of class "DecisionTreeClassifier", with the following attributes and public methods:

Attributes

model: DataFrame
Trained model content.
decision.rules: DataFrame
Rules for decision tree to make decisions.
confusion.matrix: DataFrame
Confusion matrix used to evaluate the performance of classification algorithms

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > dtc <- hanaml.DecisionTreeClassifier(algorithm='cart', data=df)
   > dtc$CreateModelState()

Arguments:

model: DataFrame - DataFrame containing the model for parsing. Defaults to self$model.
algorithm: character - Specifies the PAL algorithm associated with model. Defaults to self$pal.algorithm.
func: character - Specifies the functionality for Unified Classification/Regression. Defaults to self$func.
state.description: character - A summary string for the generated model state. Defaults to "ModelState".
force: logic - Specifies whether or not the replace existing state for model. Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > dtc <- hanaml.DecisionTreeClassifier(algorithm='cart', data=df)
   > dtc$CreateModelState()

After using the model state for real-time scoring, we can delete by calling


   > dtc$DelateModelState()

Arguments:

state: DataFrame - DataFrame containing the state info. Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Examples

Input DataFrame for training:


 > data$Collect()
    OUTLOOK TEMP HUMIDITY WINDY       CLASS
1     Sunny   75       70   Yes        Play
2     Sunny   80       90   Yes Do not Play
3     Sunny   85       85    No Do not Play
4     Sunny   72       95    No Do not Play
5     Sunny   69       70    No        Play
6  Overcast   72       90   Yes        Play
7  Overcast   83       78    No        Play
8  Overcast   64       65   Yes        Play
9  Overcast   81       75    No        Play
10     Rain   71       80   Yes Do not Play
11     Rain   65       70   Yes Do not Play
12     Rain   75       80    No        Play
13     Rain   68       80    No        Play
14     Rain   70       96    No        Play

Call the function:


> dtc <- hanaml.DecisionTreeClassifier(algorithm = "c45", data = data,
                                       features = list("TEMP", "HUMIDITY", "WINDY"),
                                       label = "CLASS", key= NULL
                                       min.records.of.parent = 2, min.records.of.leaf = 1,
                                       thread.ratio = 0.4, split.threshold = 1e-5,
                                       model.format = "json",  output.rules = TRUE )

OR giving input to create a model as a formula:


> dtc <- hanaml.DecisionTreeClassifier(algorithm = "c45", data = data,
                                       formula=CATEGORY~V1+V2+V3, key= "ID"
                                       min.records.of.parent = 2, min.records.of.leaf = 1,
                                       thread.ratio = 0.4, split.threshold = 1e-5,
                                       model.format = "json", output.rules = TRUE)

Output:


> dtc$decision.rules$Collect()
   ROW_INDEX                                                    RULES_CONTENT
1          0                                        (TEMP>=84) => Do not Play
2          1                          (TEMP<84) && (OUTLOOK=Overcast) => Play
3          2          (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY<82.5) => Play
4          3  (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY>=82.5) => Do not Play
5          4        (TEMP<84) && (OUTLOOK=Rain) && (WINDY=Yes) => Do not Play
6          5                (TEMP<84) && (OUTLOOK=Rain) && (WINDY=No) => Play

Arguments

Value

Examples

See also