hanaml.HGBTClassifier is a R wrapper for SAP HANA PAL HGBT.

hanaml.HGBTClassifier(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  formula = NULL,
  n.estimators = NULL,
  random.state = NULL,
  subsample = NULL,
  max.depth = NULL,
  split.threshold = NULL,
  learning.rate = NULL,
  split.method = NULL,
  sketch.eps = NULL,
  fold.num = NULL,
  min.sample.weight.leaf = NULL,
  min.samples.leaf = NULL,
  max.w.in.split = NULL,
  col.subsample.split = NULL,
  col.subsample.tree = NULL,
  lambda = NULL,
  alpha = NULL,
  adopt.prior = NULL,
  evaluation.metric = NULL,
  reference.metric = NULL,
  parameter.range = NULL,
  parameter.values = NULL,
  resampling.method = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  timeout = NULL,
  progress.indicator.id = NULL,
  calculate.importance = NULL,
  calculate.cm = NULL,
  base.score = NULL,
  thread.ratio = NULL,
  categorical.variable = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula, or a feature and label combination, but do not provide both.
Defaults to NULL.

n.estimators

integer, optional
Total iteration number, which is equivalent to the number of trees in the final model.
Defaults to 10.

random.state

integer, optional
The seed for random number generating.
0 - current time as seed.
Others - the seed.
Defaults to 0.

subsample

double, optional
The sample rate of row (data points).
Defaults to 1.0.

max.depth

integer, optional
The maximum depth of a tree.
Defaults to 6.

split.threshold

double, optional
The minimum loss change value to make a split in tree growth (gamma in the equation).
Default to 0.

learning.rate

double, optional.
Learning rate of each iteration, must be within the range (0, 1).
Defaults to 0.3.

split.method

('exact', 'sketch', 'sampling'), optional
The method to finding split point for integeral features.
- 'exact':trying all possible points.
- 'sketch': accounting for the distribution of the sum of hessian.
- 'sampling':samples the split point randomly.
The exact method comparably has the highest test accuracy, but costs more time. On the other hand, the other two methods have relative higher computational efficiency but might lead to lower test accuracy, and are considered to be adopted as the training data set is huge.
Valid only for integer features.
Defaults to 'exact'.

sketch.eps

double, optional
The epsilon of the sketch method. It indicates that the sum of hessian between two split points is not larger than this value. That is, the number of bins is approximately 1/eps.
The less is this value, the more split points are tried.
Defaults to 0.1.

fold.num

integer, optional
Specify fold number for cross validation method.
Mandatory and valid only when resampling.method is set to cv or stratified_cv.
No default value.

min.sample.weight.leaf

double, optional
The minimum summation of sample weights (hessian) in leaf node.
Defaults to 1.0.

min.samples.leaf

integer, optional
The minimum number of data in a leaf node.
Defaults to 1.

max.w.in.split

double, optional
The maximum weight constraint assigned to each tree node.
Defaults to 0 (i.e. no constraint).

col.subsample.split

double, optional
The fraction of features used for each split, should be within range (0, 1].
Defaults to 1.0.

col.subsample.tree

double, optional
The fraction of features used for each tree growth, should be within range (0, 1]
Defaults to 1.0.

lambda

double, optional
Weight of L2 regularization for the target loss function. Should be within range [0, 1].
Defaults to 1.0.

alpha

double, optional
Weight of L1 regularization for the target loss function.
Defaults to 1.0.

adopt.prior

logical, optional
Indicates whether to adopt the prior distribution as the initial point. To be specific, use average value if it is a regression problem, and use frequencies of labels if it is a classification problem. Defaults to FALSE.

evaluation.metric

character, optional
Specify evaluation metric for model evaluation or parameter selection.
Valid values include: "nll","error_rate","auc".
It is mandatory if resampling.method is set.
No default value.

reference.metric

character or list of characters, optional
A list of reference metrics.
Any element of the list must be a valid option of evaluation.metric.
No default value.

parameter.range

list, optional
Indicates the range of parameters for selection.
Each element is a vector of numbers with the following structure: c(<begin-value>, <step-size>, <end-value>).
All elements must be named, with names being the following valid parameters for model selection:
n.estimators, max.depth, learning.rate, min.sample.weight.leaf, max.w.in.split, col.subsample.split, col.subsample.tree, lambda, alpha, scale.pos.w, base.score. Simple example for illustration - a list of two vectors
list(n.estimators = c(4, 2, 10), learning.rate = c(0.1, 0.3, 1))
Valid only when parameter selection is activated.

parameter.values

list, optional
Indicates the values of parameters selection.
Each element must be a vector of valid parameter values.
All elements must be named, with names being the following valid parameters for model selection:
n.estimators, max.depth, learning.rate, min.sample.weight.leaf, max.w.in.split, col.subsample.split, col.subsample.tree, lambda, alpha, scale.pos.w, base.score. Simple example for illustration - a list of two vectors
list(n.estimators = c(4, 5, 6), learning.rate = c(2.0, 2.5, 3)) Valid only when parameter selection is activated.

resampling.method

character, optional
Specify resampling method for model evaluation or parameter selection.

  • "cv"

  • "stratified_cv"

  • "bootstrap"

  • "stratified_bootstrap"

If no value is specified for this parameter, then no model evaluation or parameter selection will be activated.
No default value.

repeat.times

integer, optional
Specify repeat times for resampling.
Defaults to 1.

param.search.strategy

character, optional
Specify value to this parameter to active parameter selection.

  • "grid"

  • "random"

If this parameter is not set, then only model evaluation is activated.
No default value.

random.search.times

integer, optional
Specify times to randomly select candidate parameters for selection.
Mandatory and valid when param.search.strategy is set to 'random'.
No default value.

timeout

integer, optional
Specify maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified.
Defaults to 0.

progress.indicator.id

character, optional
Set an ID of progress indicator for model evaluation or parameter selection. No progress indicator will be active if no value is provided.
No default value.

calculate.importance

logical, optional
Determines whether to calculate variable importance.
Defaults to TRUE.

calculate.cm

logical, optional
Determines whether to calculate confusion matrix.
Defaults to TURE.

base.score

double, optional
Initial prediction score for all instances. Global bias for sufficient number of iterations(changing this value will not have too much effect).
Defaults to 0.5 for binary classification; 0 otherwise.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined.
Defaults to -1.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

  • "VARCHAR" and "NVARCHAR": categorical

  • "INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

Value

A "HGBTClassifier" object with the following attributes:
model DataFrame

  • ROW_INDEX - model row index

  • TREE_INDEX - tree index( -1 indicates the global information.)

  • MODEL_CONTENT - model content

feature.importances DataFrame

  • VARIABLE_NAME - Independent variable name

  • IMPORTANCE - Variable importance

confusion.matrix DataFrame

  • ACTUAL_CLASS - The actual class name

  • PREDICTED_CLASS - The predicted class name

  • COUNT - Number of records

stats DataFrame

  • STAT_NAME - Statistics name

  • STAT_VALUE - Statistics value

cv DataFrame

  • PARM_NAME - parameter name

  • INT_VALUE - integer value

  • DOUBLE_VALUE - double value

  • STRING_VALUE - character value

Examples

Input DataFrame data:

> data$Collect()
    ATT1  ATT2   ATT3  ATT4 LABEL
1   1.0  10.0  100.0   1.0     A
2   1.1  10.1  100.0   1.0     A
3   1.2  10.2  100.0   1.0     A
4   1.3  10.4  100.0   1.0     A
5   1.2  10.3  100.0   1.0     A
6   4.0  40.0  400.0   4.0     B
7   4.1  40.1  400.0   4.0     B

Call the function:

> ghc <- hanaml.HGBTClassifier(data = data,
                              features = c("ATT1", "ATT2", "ATT3", "ATT4"),
                              label = "LABEL",
                              n.estimators = 4, split.threshold = 0,
                              learning.rate = 0.5, fold.num = 5, max.depth = 6,
                              evaluation.metric = "error.rate", reference.metric = c("auc"),
                              parameter.range = list("learning.rate" = c(0.1, 1.0, 3),
                                                       "n.estimators" = c(4, 3, 10),
                                                       "split.threshold" = c(0.1, 0.3, 1.0)))

Output:

> ghc.stats$Collect()

         STAT_NAME STAT_VALUE
1  ERROR_RATE_MEAN   0.133333
2   ERROR_RATE_VAR  0.0266666
3         AUC_MEAN        0.9

See also