Hybrid Gradient Boosting (HGBT) Tree Classifier

hanaml.HGBTClassifier is a R wrapper for SAP HANA PAL HGBT.

hanaml.HGBTClassifier(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  formula = NULL,
  n.estimators = NULL,
  random.state = NULL,
  subsample = NULL,
  max.depth = NULL,
  split.threshold = NULL,
  learning.rate = NULL,
  split.method = NULL,
  sketch.eps = NULL,
  fold.num = NULL,
  min.sample.weight.leaf = NULL,
  min.samples.leaf = NULL,
  max.w.in.split = NULL,
  col.subsample.split = NULL,
  col.subsample.tree = NULL,
  lambda = NULL,
  alpha = NULL,
  adopt.prior = NULL,
  evaluation.metric = NULL,
  reference.metric = NULL,
  parameter.range = NULL,
  parameter.values = NULL,
  resampling.method = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  timeout = NULL,
  progress.indicator.id = NULL,
  calculate.importance = NULL,
  calculate.cm = NULL,
  base.score = NULL,
  thread.ratio = NULL,
  categorical.variable = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
key	`character, optional` Name of the ID column. If not provided, the data is assumed to have no ID column. No default value.
features	`character of list of characters, optional` Name of feature columns. If not provided, it defaults all non-key, non-label columns of data.
label	`character, optional` Name of the column which specifies the dependent variable. Defaults to the last column of data if not provided.
formula	`formula type, optional` Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination, but do not provide both. Defaults to NULL.
n.estimators	`integer, optional` Total iteration number, which is equivalent to the number of trees in the final model. Defaults to 10.
random.state	`integer, optional` The seed for random number generating. 0 - current time as seed. Others - the seed. Defaults to 0.
subsample	`double, optional` The sample rate of row (data points). Defaults to 1.0.
max.depth	`integer, optional` The maximum depth of a tree. Defaults to 6.
split.threshold	`double, optional` The minimum loss change value to make a split in tree growth (gamma in the equation). Default to 0.
learning.rate	`double, optional.` Learning rate of each iteration, must be within the range (0, 1). Defaults to 0.3.
split.method	`('exact', 'sketch', 'sampling'), optional` The method to finding split point for integeral features. - 'exact':trying all possible points. - 'sketch': accounting for the distribution of the sum of hessian. - 'sampling':samples the split point randomly. The exact method comparably has the highest test accuracy, but costs more time. On the other hand, the other two methods have relative higher computational efficiency but might lead to lower test accuracy, and are considered to be adopted as the training data set is huge. Valid only for integer features. Defaults to 'exact'.
sketch.eps	`double, optional` The epsilon of the sketch method. It indicates that the sum of hessian between two split points is not larger than this value. That is, the number of bins is approximately 1/eps. The less is this value, the more split points are tried. Defaults to 0.1.
fold.num	`integer, optional` Specify fold number for cross validation method. Mandatory and valid only when resampling.method is set to cv or stratified_cv. No default value.
min.sample.weight.leaf	`double, optional` The minimum summation of sample weights (hessian) in leaf node. Defaults to 1.0.
min.samples.leaf	`integer, optional` The minimum number of data in a leaf node. Defaults to 1.
max.w.in.split	`double, optional` The maximum weight constraint assigned to each tree node. Defaults to 0 (i.e. no constraint).
col.subsample.split	`double, optional` The fraction of features used for each split, should be within range (0, 1]. Defaults to 1.0.
col.subsample.tree	`double, optional` The fraction of features used for each tree growth, should be within range (0, 1] Defaults to 1.0.
lambda	`double, optional` Weight of L2 regularization for the target loss function. Should be within range [0, 1]. Defaults to 1.0.
alpha	`double, optional` Weight of L1 regularization for the target loss function. Defaults to 1.0.
adopt.prior	`logical, optional` Indicates whether to adopt the prior distribution as the initial point. To be specific, use average value if it is a regression problem, and use frequencies of labels if it is a classification problem. Defaults to FALSE.
evaluation.metric	`character, optional` Specify evaluation metric for model evaluation or parameter selection. Valid values include: "nll","error_rate","auc". It is mandatory if `resampling.method` is set. No default value.
reference.metric	`character or list of characters, optional` A list of reference metrics. Any element of the list must be a valid option of evaluation.metric. No default value.
parameter.range	`list, optional` Indicates the range of parameters for selection. Each element is a vector of numbers with the following structure: c(<begin-value>, <step-size>, <end-value>). All elements must be named, with names being the following valid parameters for model selection: n.estimators, max.depth, learning.rate, min.sample.weight.leaf, max.w.in.split, col.subsample.split, col.subsample.tree, lambda, alpha, scale.pos.w, base.score. Simple example for illustration - a list of two vectors list(n.estimators = c(4, 2, 10), learning.rate = c(0.1, 0.3, 1)) Valid only when parameter selection is activated.
parameter.values	`list, optional` Indicates the values of parameters selection. Each element must be a vector of valid parameter values. All elements must be named, with names being the following valid parameters for model selection: n.estimators, max.depth, learning.rate, min.sample.weight.leaf, max.w.in.split, col.subsample.split, col.subsample.tree, lambda, alpha, scale.pos.w, base.score. Simple example for illustration - a list of two vectors list(n.estimators = c(4, 5, 6), learning.rate = c(2.0, 2.5, 3)) Valid only when parameter selection is activated.
resampling.method	`character, optional` Specify resampling method for model evaluation or parameter selection. `"cv"` `"stratified_cv"` `"bootstrap"` `"stratified_bootstrap"` If no value is specified for this parameter, then no model evaluation or parameter selection will be activated. No default value.
repeat.times	`integer, optional` Specify repeat times for resampling. Defaults to 1.
param.search.strategy	`character, optional` Specify value to this parameter to active parameter selection. `"grid"` `"random"` If this parameter is not set, then only model evaluation is activated. No default value.
random.search.times	`integer, optional` Specify times to randomly select candidate parameters for selection. Mandatory and valid when `param.search.strategy` is set to 'random'. No default value.
timeout	`integer, optional` Specify maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified. Defaults to 0.
progress.indicator.id	`character, optional` Set an ID of progress indicator for model evaluation or parameter selection. No progress indicator will be active if no value is provided. No default value.
calculate.importance	`logical, optional` Determines whether to calculate variable importance. Defaults to TRUE.
calculate.cm	`logical, optional` Determines whether to calculate confusion matrix. Defaults to TURE.
base.score	`double, optional` Initial prediction score for all instances. Global bias for sufficient number of iterations(changing this value will not have too much effect). Defaults to 0.5 for binary classification; 0 otherwise.
thread.ratio	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined. Defaults to -1.
categorical.variable	`character or list/vector of characters, optional` Indicates features should be treated as categorical variable. The default behavior is dependent on what input is given: "VARCHAR" and "NVARCHAR": categorical "INTEGER" and "DOUBLE": continuous. VALID only for variables of "INTEGER" type, omitted otherwise. No default value.

Value

A "HGBTClassifier" object with the following attributes:
model DataFrame

ROW_INDEX - model row index
TREE_INDEX - tree index( -1 indicates the global information.)
MODEL_CONTENT - model content

feature.importances DataFrame

VARIABLE_NAME - Independent variable name
IMPORTANCE - Variable importance

confusion.matrix DataFrame

ACTUAL_CLASS - The actual class name
PREDICTED_CLASS - The predicted class name
COUNT - Number of records

stats DataFrame

STAT_NAME - Statistics name
STAT_VALUE - Statistics value

cv DataFrame

PARM_NAME - parameter name
INT_VALUE - integer value
DOUBLE_VALUE - double value
STRING_VALUE - character value

Examples

Input DataFrame data:

> data$Collect()
    ATT1  ATT2   ATT3  ATT4 LABEL
1   1.0  10.0  100.0   1.0     A
2   1.1  10.1  100.0   1.0     A
3   1.2  10.2  100.0   1.0     A
4   1.3  10.4  100.0   1.0     A
5   1.2  10.3  100.0   1.0     A
6   4.0  40.0  400.0   4.0     B
7   4.1  40.1  400.0   4.0     B

Call the function:

> ghc <- hanaml.HGBTClassifier(data = data,
                              features = c("ATT1", "ATT2", "ATT3", "ATT4"),
                              label = "LABEL",
                              n.estimators = 4, split.threshold = 0,
                              learning.rate = 0.5, fold.num = 5, max.depth = 6,
                              evaluation.metric = "error.rate", reference.metric = c("auc"),
                              parameter.range = list("learning.rate" = c(0.1, 1.0, 3),
                                                       "n.estimators" = c(4, 3, 10),
                                                       "split.threshold" = c(0.1, 0.3, 1.0)))

Output:

> ghc.stats$Collect()

         STAT_NAME STAT_VALUE
1  ERROR_RATE_MEAN   0.133333
2   ERROR_RATE_VAR  0.0266666
3         AUC_MEAN        0.9

Arguments

Value

Examples

See also