hanaml.DecisionTreeClassifier.Rd
hanaml.DecisionTreeClassifier is a R wrapper for SAP HANA PAL Decision tree.
hanaml.DecisionTreeClassifier(
algorithm = NULL,
data = NULL,
key = NULL,
features = NULL,
label = NULL,
formula = NULL,
thread.ratio = NULL,
allow.missing.dependent = NULL,
percentage = NULL,
min.records.of.parent = NULL,
min.records.of.leaf = NULL,
max.depth = NULL,
categorical.variable = NULL,
split.threshold = NULL,
use.surrogate = NULL,
model.format = NULL,
discretization.type = NULL,
bins = NULL,
max.branch = NULL,
merge.threshold = NULL,
priors = NULL,
output.rules = NULL,
output.confusion.matrix = NULL,
evaluation.metric = NULL,
parameter.range = NULL,
parameter.values = NULL,
resampling.method = NULL,
repeat.times = NULL,
fold.num = NULL,
param.search.strategy = NULL,
random.search.times = NULL,
timeout = NULL,
progress.indicator.id = NULL
)
character
Algorithm used to grow a decision tree.
Valid values are "c45", "chaid", and "cart".
DataFrame
DataFrame containting the data.
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.
formula type, optional
Formula to be used for model generation.
format = label~<feature_list>
e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula,
or a feature and label combination, but do not provide both.
Defaults to NULL.
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads. Values between 0 and 1 will use up to
that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads
used is then be heuristically determined.
Defaults to -1.
logical, optional
Specifies if a missing target value is allowed.
FALSE does not allow the missing target value.
An error occurs if a missing target is present.
TRUE allows the missing target value.
The datum with the missing target is removed.
Defaults to TRUE.
double, optional
Specifies the percentage of the input data that will be used to build the tree model.
The rest of the data will be used for pruning.
Defaults to 1.0.
integer, optional
Specifies the stop condition. If the number of records in one node is less
than the specified value, the algorithm stops splitting.
Defaults to 2.
integer, optional
Promises the minimum number of records in a leaf.
Defaults to 1.
integer, optional
The maximum depth of a tree.
By default it is unlimited.
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.
double, optional
Specifies the stop condition for a node.
"c45"
- The information gain ratio of
the best split is less than this value.
"chaid"
- The p-value of the best
split is greater than or equal to this value.
"cart"
- The reduction of Gini
index or relative MSE of the best split is
less than this value.
The smaller the threshold value is, the larger
a c45
or cart
tree grows.
On the contrary, chaid
will grow a larger tree with a
larger threshold value.
Defaults to 1e-5 for c45
and cart
, and 0.05 for chaid
.
logical, optional
Indicates whether to use surrogate split when NULL
values are encountered.
FALSE does not use surrogate split.
TRUE will use a surrogate split.
Only valid for cart
.
Defaults to TRUE.
character, optional
Specifies the tree model format for store.
Valid options are "json" and "pmml".
Defaults to "json".
character, optional
Specifies the strategy for discretizing continuous attributes.
Valid options are "mdlpc" and "equal.freq".
Valid only for C45 and chaid
.
Defaults to "mdlpc".
list
Specifies the number of bins for discretization in list.
Each element in the list must be named, with the name being a column name,
and the values be the number of bins for discretizing that column.
Only valid when discretization type is "equal_freq".
Defaults to '10' for each column.
integer, optional
Specifies the maximum number of branches.
Valid only for chaid
.
Defaults to 10.
double, optional
Specifies the merge condition for chaid
.
If the metric value is greater than
or equal to the specified value, the algorithm will
merge the two branches.
Only valid for chaid
.
Defaults to 0.05.
named list/vector of numerics, optional
Specifies the priori probability of every class label(in the form of class.label = probability).
The default value is determined from the data.
logical, optional
Specifies whether to output decision rules or not.
FALSE will not output decision rules.
TRUE will output decision rules.
Defaults to TRUE.
logical, optional
Specifies whether or not to produce an output confusion matrix.
FALSE will not output a confusion matrix.
TRUE will output confusion matrix.
Defaults to TRUE.
c("error_rate", "nll", "auc"), optional
Specifies the evaluation metric for model evaluation or parameter selection.
Defaults to "error_rate".
list, optional
Specifies range of the following parameters for parameter selection:min.records.of.leaf, min.records.of.parent, max.depth,
split.threshold, max.branch, merge.threshold
.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(max.depth = c(10, 1, 20)).
If param.search.strategy
is 'random', then step has no effect
and thus can be omitted.
list, optional
Specifies values of the following parameters for parameter selection:discretization.type, min.records.of.leaf,
min.records.of.parent, max.depth,
split.threshold, max.branch, merge.threshold
.
character, optional
specifies the resampling method for model evaluation or parameter selection.
Valid options include: "cv", "stratified_cv", "bootstrap", "stratified_bootstrap".
If no value is specified for this parameter, neither model evaluation
nor parameter selection is activated.
numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.
integer, optional
Specifies the fold number for the cross-validation(cv).
Mandatory and valid only when resampling.method
is "cv" or "stratified_cv".
c("grid", "random"), optional
Specifies the method to activate parameter selection.
If not specified, model selection shall not be triggered.
integer, optional
Specifies the number of times to randomly select candidate parameters for selection.
Mandatory and valid only when param.search.strategy
is "random".
integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds.
No timeout when 0 is specified.
character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
An R6 object of class "DecisionTreeClassifier", with the following attributes and
public methods:
Attributes
model: DataFrame
Trained model content.
decision.rules: DataFrame
Rules for decision tree to make decisions.
confusion.matrix: DataFrame
Confusion matrix used to evaluate the performance of classification algorithms
Methods
CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)
Usage:
> dtc <- hanaml.DecisionTreeClassifier(algorithm='cart', data=df)
> dtc$CreateModelState()
Arguments:
model: DataFrame
- DataFrame containing the model for parsing. Defaults to self$model
.
algorithm: character
- Specifies the PAL algorithm associated with model
. Defaults to self$pal.algorithm
.
func: character
- Specifies the functionality for Unified Classification/Regression.
Defaults to self$func
.
state.description: character
- A summary string for the generated model state. Defaults to "ModelState".
force: logic
- Specifies whether or not the replace existing state for model
. Defaults to FALSE.
After calling this method, an attribute state
that contains the parsed info for model
shall be assigned
to the corresponding R6 object.
DeleteModelState(state=NULL)
Usage:
Assuming we have trained a hanaml
model and created its model state, like the following:
> dtc <- hanaml.DecisionTreeClassifier(algorithm='cart', data=df)
> dtc$CreateModelState()
After using the model state for real-time scoring, we can delete by calling
> dtc$DelateModelState()
Arguments:
state: DataFrame
- DataFrame containing the state info. Defaults to self$state
.
After calling this method, the specified model state shall be cleaned up and associated memory be released.
Input DataFrame for training:
> data$Collect()
OUTLOOK TEMP HUMIDITY WINDY CLASS
1 Sunny 75 70 Yes Play
2 Sunny 80 90 Yes Do not Play
3 Sunny 85 85 No Do not Play
4 Sunny 72 95 No Do not Play
5 Sunny 69 70 No Play
6 Overcast 72 90 Yes Play
7 Overcast 83 78 No Play
8 Overcast 64 65 Yes Play
9 Overcast 81 75 No Play
10 Rain 71 80 Yes Do not Play
11 Rain 65 70 Yes Do not Play
12 Rain 75 80 No Play
13 Rain 68 80 No Play
14 Rain 70 96 No Play
Call the function:
> dtc <- hanaml.DecisionTreeClassifier(algorithm = "c45", data = data,
features = list("TEMP", "HUMIDITY", "WINDY"),
label = "CLASS", key= NULL
min.records.of.parent = 2, min.records.of.leaf = 1,
thread.ratio = 0.4, split.threshold = 1e-5,
model.format = "json", output.rules = TRUE )
OR giving input to create a model as a formula:
> dtc <- hanaml.DecisionTreeClassifier(algorithm = "c45", data = data,
formula=CATEGORY~V1+V2+V3, key= "ID"
min.records.of.parent = 2, min.records.of.leaf = 1,
thread.ratio = 0.4, split.threshold = 1e-5,
model.format = "json", output.rules = TRUE)
Output:
> dtc$decision.rules$Collect()
ROW_INDEX RULES_CONTENT
1 0 (TEMP>=84) => Do not Play
2 1 (TEMP<84) && (OUTLOOK=Overcast) => Play
3 2 (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY<82.5) => Play
4 3 (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY>=82.5) => Do not Play
5 4 (TEMP<84) && (OUTLOOK=Rain) && (WINDY=Yes) => Do not Play
6 5 (TEMP<84) && (OUTLOOK=Rain) && (WINDY=No) => Play