Hybrid Gradient Boosting Tree (HGBT) Classifier

hanaml.HGBTClassifier is a R wrapper for SAP HANA PAL HGBT.

hanaml.HGBTClassifier(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  formula = NULL,
  n.estimators = NULL,
  random.state = NULL,
  subsample = NULL,
  max.depth = NULL,
  split.threshold = NULL,
  learning.rate = NULL,
  split.method = NULL,
  sketch.eps = NULL,
  fold.num = NULL,
  min.sample.weight.leaf = NULL,
  min.samples.leaf = NULL,
  max.w.in.split = NULL,
  col.subsample.split = NULL,
  col.subsample.tree = NULL,
  lambda = NULL,
  alpha = NULL,
  adopt.prior = NULL,
  evaluation.metric = NULL,
  reference.metric = NULL,
  parameter.range = NULL,
  parameter.values = NULL,
  resampling.method = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  timeout = NULL,
  progress.indicator.id = NULL,
  calculate.importance = NULL,
  calculate.cm = NULL,
  base.score = NULL,
  thread.ratio = NULL,
  categorical.variable = NULL,
  obj.func = NULL,
  replace.missing = NULL,
  default.missing.direction = NULL,
  feature.grouping = NULL,
  tol.rate = NULL,
  compression = NULL,
  max.bits = NULL,
  model = NULL,
  warm.start = NULL,
  max.bin.num = NULL,
  resource = NULL,
  max.resource = NULL,
  min.resource.rate = NULL,
  reduction.rate = NULL,
  aggressive.elimination = NULL,
  validation.set.rate = NULL,
  stratified.validation.set = NULL,
  tolerant.iter.num = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula, or a feature and label combination, but do not provide both.
Defaults to NULL.

n.estimators

integer, optional
Total iteration number, which is equivalent to the number of trees in the final model.
Defaults to 10.

random.state

integer, optional
The seed for random number generating.
0 - current time as seed.
Others - the seed.
Defaults to 0.

subsample

double, optional
The sample rate of row (data points).
Defaults to 1.0.

max.depth

integer, optional
The maximum depth of a tree.
Defaults to 6.

split.threshold

double, optional
The minimum loss change value to make a split in tree growth (gamma in the equation).
Default to 0.

learning.rate

double, optional.
Learning rate of each iteration, must be within the range (0, 1).
Defaults to 0.3.

split.method

c("exact", "sketch", "sampling", "histogram"), optional
The method to finding split point for integeral features.

"exact": trying all possible points.
"sketch": accounting for the distribution of the sum of hessian.
"sampling": samples the split point randomly.
"histogram": builds histogram upon data and uses it as split point.

The exact method comparably has the highest test accuracy, but costs more time. On the other hand, the other three methods have relative higher computational efficiency but might lead to lower test accuracy, and are considered to be adopted as the training data set is huge.
Valid only for numerical features.
Defaults to "exact".

sketch.eps

double, optional
The epsilon of the sketch method. It indicates that the sum of hessian between two split points is not larger than this value. That is, the number of bins is approximately 1/eps.
The less is this value, the more split points are tried.
Defaults to 0.1.

fold.num

integer, optional
Specify fold number for cross validation method.
Mandatory and valid only resampling.method is specified with a valid value that starts with "cv".
No default value.

min.sample.weight.leaf

double, optional
The minimum summation of sample weights (hessian) in leaf node.
Defaults to 1.0.

min.samples.leaf

integer, optional
The minimum number of data in a leaf node.
Defaults to 1.

max.w.in.split

double, optional
The maximum weight constraint assigned to each tree node.
Defaults to 0 (i.e. no constraint).

col.subsample.split

double, optional
The fraction of features used for each split, should be within range (0, 1].
Defaults to 1.0.

col.subsample.tree

double, optional
The fraction of features used for each tree growth, should be within range (0, 1]
Defaults to 1.0.

lambda

double, optional
Weight of L2 regularization for the target loss function. Should be within range [0, 1].
Defaults to 1.0.

alpha

double, optional
Weight of L1 regularization for the target loss function.
Defaults to 1.0.

adopt.prior

logical, optional
Indicates whether to adopt the prior distribution as the initial point. To be specific, for a classification problem the frequencies of labels is used for initialization.
Defaults to FALSE.

evaluation.metric

character, optional
Specify evaluation metric for model evaluation or parameter selection.
Valid values include: "nll","error_rate","auc".
It is mandatory if resampling.method is set.
No default value.

reference.metric

character or list of characters, optional
A list of reference metrics.
Any element of the list must be a valid option of evaluation.metric.
No default value.

parameter.range

list, optional
Indicates the range of parameters for selection.
Each element is a vector of numbers with the following structure: c(<begin-value>, <step-size>, <end-value>).
All elements must be named, with names being the following valid parameters for model selection:
n.estimators, max.depth, learning.rate, min.sample.weight.leaf, max.w.in.split, col.subsample.split, col.subsample.tree, lambda, alpha, scale.pos.w, base.score. Simple example for illustration - a list of two vectors
list(n.estimators = c(4, 2, 10), learning.rate = c(0.1, 0.3, 1))
Valid only when parameter selection is activated.

parameter.values

list, optional
Indicates the values of parameters selection.
Each element must be a vector of valid parameter values.
All elements must be named, with names being the following valid parameters for model selection:
n.estimators, max.depth, learning.rate, min.sample.weight.leaf, max.w.in.split, col.subsample.split, col.subsample.tree, lambda, alpha, scale.pos.w, base.score. Simple example for illustration - a list of two vectors
list(n.estimators = c(4, 5, 6), learning.rate = c(2.0, 2.5, 3)) Valid only when parameter selection is activated.

resampling.method

character, optional
Specify resampling method for model evaluation or parameter selection.
Valid options include: "cv", "cv_sha", "cv_hyperband", "bootstrap", "bootstrap_sha", "bootstrap_hyperband", "stratified_cv", "stratified_cv_sha", "stratified_cv_hyperband", "stratified_bootstrap", "stratified_bootstrap_sha", "stratified_bootstrap_hyperband".
Resampling methods that end with "sha"(representing successive-halving) or "hyperband" are for parameter selection only, not for model evaluation.
If no value is specified for this parameter, then no model evaluation or parameter selection will be activated.
No default value.

repeat.times

integer, optional
Specify repeat times for resampling.
Defaults to 1.

param.search.strategy

character, optional
Specify value to this parameter to active parameter selection.

"grid"
"random"

If this parameter is not set, then only model evaluation is activated.
No default value.

random.search.times

integer, optional
Specify times to randomly select candidate parameters for selection.
Mandatory and valid when param.search.strategy is set to 'random'.
No default value.

timeout

integer, optional
Specify maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified.
Defaults to 0.

progress.indicator.id

character, optional
Set an ID of progress indicator for model evaluation or parameter selection. No progress indicator will be active if no value is provided.
No default value.

calculate.importance

logical, optional
Determines whether to calculate variable importance.
Defaults to TRUE.

calculate.cm

logical, optional
Determines whether to calculate confusion matrix.
Defaults to TURE.

base.score

double, optional
Initial prediction score for all instances. Global bias for sufficient number of iterations(changing this value will not have too much effect).
Defaults to 0.5 for binary classification; 0 otherwise.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined.
Defaults to -1.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

obj.func

character, optional
Specifies the objective function to optimize, valid options include：

"logistic" : Logistic loss function(for binary classification)
"hinge" : Hinge loss function(for binary classification)
"softmax" : Softmax function(for multi-classification)

Defaults to "logistic" for binary classification, and "softmax" for multi-classification.

replace.missing

logical, optional
Specifies whether or not to replace missing values by another value in a feature.
If set as TRUE, then the replacement value is the mean value for a continuous feature, and the mode(i.e. most frequent) value for a categorical feature.
Defaults to TRUE.

default.missing.direction

c("left", "right"), optional
Specifies the default direction where missing value will go to while tree splitting.
Defaults to "right".

feature.grouping

logical, optional
Indicates whether or not to group sparse features that only contains one significant value in each row.
Defaults to FALSE.

tol.rate

numeric, optional
While applying feature grouping, features are still merged when there are rows containing more than one significant value only if the rate of such rows does not exceed the value specified in tol.rate.
Defaults to 0.0001.

compression

logical, optional
Indicates whether or not the trained model should be compressed.
Defaults to FALSE.

max.bits

integer, optional
Specifies the maximum number of bits to quantize continuous features.
Equivalent to use 2^max.bits bins. The value must be less than 31.
Defaults to 12.

model

DataFrame, optional
The model used for warm start.
Defaults to NULL.

warm.start

logical, optional
When set to TRUE, use the model and train more trees to the existing model with new input data.
If no model is provided, an error will be prompted.
Defaults to FALSE.

max.bin.num

integer, optional
Specifies the maximum bin number for histogram method.
Decreasing this number gains better performance in terms of running time at a cost of accuracy.
Valid only when split.method is "histogram".
Defaults to 256.

resource

c("data.size", "n.estimators"), optional
Specifies the resource type for successive-halving(SHA) or hyperband method:

"data.size": size of the input data as resource type.
"n.estimators": number of trees in the final estimator as resource type.

Valid only when resampling.method is specified and ends with either "sha" or "hyperband"(e.g. "cv_sha", "bootstrap_hyperband").
Defaults to "data.size".

max.resource

integer, optional
Specifies the maximum number of trees allowed in use for SHA or hyperband method.
Mandatory when resource is set as "n.estimators", and valid when resampling.method is specified and ends with either "sha" or "hyperband"(e.g. "cv_sha", "bootstrap_hyperband").
No default value.

min.resource.rate

numeric, optional
Specifies the minimum resource rate that should be used in SHA or hyperband iteration.
Valid only when resampling.method is specified with a valid option that ends with "sha" or "hyperband"(e.g. "cv_sha", "bootstrap_hyperband").
Defaults to 0.0 if resource is specified as "data.size"(or not specified), and defaults to 1/max.resource when resource is specified as "n.estimators".

reduction.rate

numeric, optional
Specifies reduction rate in SHA or Hyperband method.
Valid when resampling.method is specified and ends with either "sha" or "hyperband"(e.g. "cv_sha", "bootstrap_hyperband").
Defaults to 3.0.

aggressive.elimination

logical, optional
Specifies whether to apply aggressive elimination while using SHA method.

FALSE: do not apply aggressive elimination.
TRUE: apply aggressive elimination.

Valid only when resampling.method is specified and ends with "sha".
Defaults to FALSE.
Note: Aggressive elimination happens when the data size and parameters size to be searched does not match, and there are still bunch of parameters to be searched while data size reaches its upper limits. If aggressive elimination is applied, lower bound of limit of data size will be used multiple times first to reduce number of parameters.

validation.set.rate

numeric, optional
Specifies the rate of validation set that be sampled from data set.
If 0.0 is set, then no early stop will be applied.
Defaults to 0.0.

stratified.validation.set

logical, optional
Specifies whether to apply stratified method when sampling the validation set.
Valid only when validation.set.rate is greater than 0.0.
Defaults to FALSE.

tolerant.iter.num

integer, optional
Specifies how many consecutive deteriorated iterations should be observed before applying early stop.
Valid only when validation.set.rate is greater than 0.0.
Defaults to 10.

Value

An R6 object of class "HGBTClassifier" with the following attributes and methods:

Attributes

model: DataFrame

ROW_INDEX - model row index
TREE_INDEX - tree index( -1 indicates the global information.)
MODEL_CONTENT - model content

feature.importances: DataFrame

VARIABLE_NAME - Independent variable name
IMPORTANCE - Variable importance

confusion.matrix: DataFrame

ACTUAL_CLASS - The actual class name
PREDICTED_CLASS - The predicted class name
COUNT - Number of records

stats: DataFrame

STAT_NAME - Statistics name
STAT_VALUE - Statistics value

cv: DataFrame

PARM_NAME - parameter name
INT_VALUE - integer value
DOUBLE_VALUE - double value
STRING_VALUE - character value

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > hgc <- hanaml.HGBTClassifier(data=df)
   > hgc$CreateModelState()

Arguments:

model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model.
algorithm: character
Specifies the PAL algorithm associated with model.
Defaults to self$pal.algorithm.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model.
Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > hgc <- hanaml.HGBTClassifier(data=df)
   > hgc$CreateModelState()

After using the model state for real-time scoring, we can delete the state by calling:


   > hgc$DelateModelState()

Arguments:

state: DataFrame
DataFrame containing the state info.
Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Feature Grouping

It is common that data set contains sparse features, which means they have a large part of insignificant data(zero or nearly zero).
A set of features can also be sparse, which means at most one of them contains significant data in each data row. It usually happens in features that are measured in similar sense.
For example 3 features A, B and C appearing like following:

A	B	C
1.1	0.0	0.0
0.0	0.0	2.5
0.0	0.0	0.0
0.0	-10.2	0.0

In above case, feature A, B, and C can group up as one that for each row only one datum is registered.
It can both reduce memory usage and accelerate the training process.
It is quite a complicated algorithm to find exact set of features that satisfy the requirement of feature grouping. HGBT employs a greedy algorithm that can find such sets approximately.
The requirement of features that can group up also can be relaxed that some violations can be accepted.

Relevant Parameters: feature.grouping, tol.rate

Must specify feature.grouping = TRUE to activate feature grouping.
Specify the maximum ratio of rows that can violate the requirement for feature grouping using tol.rate.

Histogram Splitting

One optimization while HGBT splitting nodes uses histogram to accelerate the training process.
It is an approximate algorithm that not only reduces the time cost but also reduces the memory cost.
To be specific, while HGBT tries to split a node in tree, it first builds histogram of that node by put feature values into bins, then evaluates splitting points by these bins. Because the number of bins is usually a lot fewer than the number of data in node, this method can accelerate the splitting process a lot. Though building the histogram still needs to visit all data in node, but it is a faster process because it only involves scanning and adding things up. Another optimization in building the histogram is that histogram of one node can always be built by subtracting histogram of its sibling from histogram of its parent. So, we can always choose to build the histogram of node that contains less data and build histogram of its sibling by subtraction, which costs less time.

Relevant Parameters: split.method, max.bin.num.

Need to set split.method = "histogram" so as to use histogram splitting.
As mentioned before, histogram splitting is an approximate algorithm that does not evaluate all potential splitting points, so setting the number of bins while building histogram becomes important. Parameter max.bin.num is exactly for this purpose. The bigger this parameter is set, the more potential splitting points are evaluated, and the more time is needed.
The default value of max.bin.num is 256. It is suggested to use this default value first, then adjust it by the fitting result of model accordingly.
When it comes to categorical features, though histogram splitting cannot be applied to them directly, HGBT will combine sparse categories if the number of categories is more than max.bin.num, and reduce the number of categories after all.

Early Stop

Early stop is a technique to stop model training before it gets too complicated and overfits the training data.
Basically it continuously monitors generalization performance of the model on another independent dataset called the validation dataset. In HGBT, the validation dataset is obtained by sampling from the input dataset(while the rest part is used for training).

Relevant Parameters: validation.set.rate, stratified.validation.set(for classification only), and tolerant.iter.num

Parameter validation.set.rate determines the sampling rate of the validation dataset from the input data.
Parameter stratified.validation.set determines whether or not to apply stratified sampling method w.r.t. class label of the input data when sampling the validation dataset from the input data. This parameter is applicable to classification only.
Parameter tolerant.iter.num determines the number of successive deteriorating iterations before early stopping.

Examples

Input DataFrame data:


> data$Collect()
    ATT1  ATT2   ATT3  ATT4 LABEL
1   1.0  10.0  100.0   1.0     A
2   1.1  10.1  100.0   1.0     A
3   1.2  10.2  100.0   1.0     A
4   1.3  10.4  100.0   1.0     A
5   1.2  10.3  100.0   1.0     A
6   4.0  40.0  400.0   4.0     B
7   4.1  40.1  400.0   4.0     B

Call the function:


> hgc <- hanaml.HGBTClassifier(data = data,
                               features = c("ATT1", "ATT2", "ATT3", "ATT4"),
                               label = "LABEL",
                               n.estimators = 4, split.threshold = 0,
                               learning.rate = 0.5, fold.num = 5, max.depth = 6,
                               evaluation.metric = "error.rate", reference.metric = c("auc"),
                               parameter.range = list("learning.rate" = c(0.1, 1.0, 3),
                                                       "n.estimators" = c(4, 3, 10),
                                                       "split.threshold" = c(0.1, 0.3, 1.0)))

Output:


> hgc.stats$Collect()

         STAT_NAME STAT_VALUE
1  ERROR_RATE_MEAN   0.133333
2   ERROR_RATE_VAR  0.0266666
3         AUC_MEAN        0.9

If you want to use the warm.start, you could provide the trained model like hgc$model: