hanaml.DecisionTreeRegressor is a R wrapper for SAP HANA PAL Decision tree.

hanaml.DecisionTreeRegressor(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  formula = NULL,
  thread.ratio = NULL,
  allow.missing.dependent = NULL,
  percentage = NULL,
  min.records.of.parent = NULL,
  min.records.of.leaf = NULL,
  max.depth = NULL,
  categorical.variable = NULL,
  split.threshold = NULL,
  use.surrogate = NULL,
  model.format = NULL,
  output.rules = NULL,
  evaluation.metric = NULL,
  parameter.range = NULL,
  parameter.values = NULL,
  resampling.method = NULL,
  repeat.times = NULL,
  fold.num = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  timeout = NULL,
  progress.indicator.id = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula, or a feature and label combination, but do not provide both.
Defaults to NULL.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined.
Defaults to -1.

allow.missing.dependent

logical, optional
Specifies if a missing target value is allowed. FALSE does not allow the missing target value. An error occurs if a missing target is present. TRUE allows the missing target value. The datum with the missing target is removed.
#' Defaults to TRUE.

percentage

double, optional
Specifies the percentage of the input data that will be used to build the tree model. The rest of the data will be used for pruning.
Defaults to 1.0.

min.records.of.parent

integer, optional
Specifies the stop condition. If the number of records in one node is less than the specified value, the algorithm stops splitting.
Defaults to 2.

min.records.of.leaf

integer, optional
Promises the minimum number of records in a leaf.
Defaults to 1.

max.depth

integer, optional
The maximum depth of a tree.
By default it is unlimited.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

  • "VARCHAR" and "NVARCHAR": categorical

  • "INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

split.threshold

double, optional
Specifies the stop condition for a node. In this case, it is the reduction of Gini index or relative MSE of the best split is less than this value in 'cart' algorithm. The smaller the value is, the larger a "cart" tree grows.
Defaults to 1e-5.

use.surrogate

logical, optional
Indicates whether to use a surrogate split when NULL values are encountered. FALSE does not use surrogate split. TRUE uses a surrogate split.
Defaults to TRUE.

model.format

character, optional
Specifies the tree model format for store. Valid options are json and pmml.
Defaults to "json".

output.rules

logical, optional
Specifies whether to output decision rules or not. FALSE does not output decision rules. TRUE outputs decision rules.
Defaults to TRUE.

evaluation.metric

c("rmse", "mae"), optional
Specifies the evaluation metric for model evaluation or parameter selection.
Defaults to "rmse".

parameter.range

list, optional
Specifies range of the following parameters for parameter selection:
min.records.of.leaf, min.records.of.parent, max.depth, split.threshold.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(split.threshold = c(1e-5, 2e-5, 1e-4)).
If param.search.strategy is "random", then the step has no effect and thus can be omitted.

parameter.values

list, optional
Specifies values of the following parameters for parameter selection:
min.records.of.leaf, min.records.of.parent, max.depth, split.threshold.

resampling.method

character, optional
specifies the resampling method for model evaluation or parameter selection.
Valid options include: "cv", "bootstrap".
If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.

repeat.times

numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.

fold.num

integer, optional
Specifies the fold number for the cross-validation(cv). Mandatory and valid only when resampling.method is specified as "cv".

param.search.strategy

c("grid", "random"), optional
Specifies the method to activate parameter selection. If not specified, model selection shall not be triggered.

random.search.times

integer, optional
Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid only when param.search.strategy is "random".

timeout

integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds.
No timeout when 0 is specified.

progress.indicator.id

character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.

Value

An R6 object of class "DecisionTreeRegressor", with the following attributes and public methods:

Attributes

  • model: DataFrame
    Trained model content.

  • decision.rules: DataFrame
    Rules for decision tree to make decisions.

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > dtr <- hanaml.DecisionTreeRegressor(data=df)
   > dtr$CreateModelState()


Arguments:

  • model: DataFrame
    DataFrame containing the model for parsing. Defaults to self$model.

  • algorithm: character
    Specifies the PAL algorithm associated with model. Defaults to self$pal.algorithm.

  • func: character
    Specifies the functionality for Unified Classification/Regression. Defaults to self$func.

  • state.description: character
    A summary string for the generated model state.
    Defaults to "ModelState".

  • force: logic
    Specifies whether or not the replace existing state for model.
    Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > dtr <- hanaml.DecisionTreeRegressor(data=df)
   > dtr$CreateModelState()


After using the model state for real-time scoring, we can delete the state by calling


   > dtr$DelateModelState()


Arguments:

  • state: DataFrame
    DataFrame containing the state info. Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Examples

Input DataFrame data:


> head(data$Collect(),5)
  OUTLOOK TEMP HUMIDITY WINDY CLASS
1   Sunny   75       70   Yes     1
2   Sunny   80       90   Yes     0
3   Sunny   85       85    No     0
4   Sunny   72       95    No     0
5   Sunny   69       70    No     1

Call the function:


> dtr <- hanaml.DecisionTreeRegressor(data,
                                      features = list("A", "B", "C"),
                                      label = "LABEL",
                                      key = "ID",
                                      min.records.of.parent = 2,
                                      min.records.of.leaf = 1,
                                      thread.ratio = 0.4,
                                      split.threshold = 1e-5,
                                      model.format = "pmml",
                                      output.rules = TRUE )

OR call the function with formula:


> dtr <- hanaml.DecisionTreeRegressor(data,
                                      formula=LABEL~A+B+C,
                                      key = NULL,
                                      min.records.of.parent = 2,
                                      min.records.of.leaf = 1,
                                      thread.ratio = 0.4,
                                      split.threshold = 1e-5,
                                      model.format = "pmml",
                                      output.rules = TRUE)