R: Hybrid Gradient Boosting Tree Classifier

hanaml.HGBTClassifier {hana.ml.r}

R Documentation

Hybrid Gradient Boosting Tree Classifier

Description

Hybrid Gradient Boosting model for classification.

Usage

hanaml.HGBTClassifier(conn.context,
                         data = NULL,
                         key = NULL,
                         features = NULL,
                         label = NULL,
                         formula = NULL,
                         n.estimators = NULL,
                         random.state = NULL,
                         subsample = NULL,
                         max.depth = NULL,
                         split.threshold = NULL,
                         learning.rate = NULL,
                         split.method = NULL,
                         sketch.eps = NULL,
                         fold.num = NULL,
                         min.sample.weight.leaf = NULL,
                         min.samples.leaf = NULL,
                         max.w.in.split = NULL,
                         col.subsample.split = NULL,
                         col.subsample.tree = NULL,
                         lambda = NULL,
                         alpha = NULL,
                         evaluation.metric = NULL,
                         reference.metric = NULL,
                         parameter.range = NULL,
                         parameter.values = NULL,
                         resampling.method = NULL,
                         repeat.times = NULL,
                         param.search.strategy = NULL,
                         random.search.times = NULL,
                         timeout = NULL,
                         progress.indicator.id = NULL,
                         calculate.importance = NULL,
                         calculate.cm = NULL,
                         base.score = NULL,
                         thread.ratio = NULL,
                         categorical.variable = NULL)

Arguments

`conn.context`	`ConnectionContext` Connection to the SAP HANA system.
`data`	`DataFrame` DataFame containing the data.
`key`	`character, optional` Name of the ID column. If not provided, it is assumed that the input has no ID column.
`features`	`character or list of characters, optional` Names of the feature columns. If not provided, it defaults to all non-ID, non-label columns.
`label`	`character, optional` Name of the dependent variable. Defaults to the last column.
`formula`	`formula type, optional` Formula to be used for model generation. format = label~<feature_list>, e.g.formula = LABEL~V1+V2+V3. You can either give the formula, or a feature and label combination. Do not provide both. Defaults to NULL, i.e. no formula shall be provided.
`n.estimators`	`integer, optional` Total iteration number, which is equivalent to the number of trees in the final model. Defaults to 10.
`random.state`	`integer, optional` The seed for random number generating. 0 - current time as seed. Others - the seed. Defaults to 0.
`subsample`	`double, optional` The sample rate of row (data points). Defaults to 1.0.
`max.depth`	`integer, optional` The maximum depth of a tree. Defaults to 6.
`split.threshold`	`double, optional` The minimum loss change value to make a split in tree growth (gamma in the equation). Default to 0.
`learning.rate`	`double, optional.` Learning rate of each iteration, must be within the range (0, 1). Defaults to 0.3.
`split.method`	`('exact', 'sketch', 'sampling'), optional` The method to finding split point for integeral features. - 'exact':trying all possible points. - 'sketch': accounting for the distribution of the sum of hessian. - 'sampling':samples the split point randomly. The exact method comparably has the highest test accuracy, but costs more time. On the other hand, the other two methods have relative higher computational efficiency but might lead to lower test accuracy, and are considered to be adopted as the training data set is huge. Valid only for integer features. Defaults to 'exact'.
`sketch.eps`	`double, optional` The epsilon of the sketch method. It indicates that the sum of hessian between two split points is not larger than this value. That is, the number of bins is approximately 1/eps. The less is this value, the more split points are tried. Defaults to 0.1.
`fold.num`	`integer, optional` Specify fold number for cross validation method. Mandatory and valid only when resampling.method is set to cv or stratified_cv. No default value.
`min.sample.weight.leaf`	`double, optional` The minimum summation of sample weights (hessian) in leaf node. Defaults to 1.0.
`min.samples.leaf`	`integer, optional` The minimum number of data in a leaf node. Defaults to 1.
`max.w.in.split`	`double, optional` The maximum weight constraint assigned to each tree node. Defaults to 0 (i.e. no constraint).
`col.subsample.split`	`double, optional` The fraction of features used for each split, should be within range (0, 1]. Defaults to 1.0.
`col.subsample.tree`	`double, optional` The fraction of features used for each tree growth, should be within range (0, 1] Defaults to 1.0.
`lambda`	`double, optional` Weight of L2 regularization for the target loss function. Should be within range [0, 1]. Defaults to 1.0.
`alpha`	`double, optional` Weight of L1 regularization for the target loss function. Defaults to 1.0.
`evaluation.metric`	`character, optional` Specify evaluation metric for model evaluation or parameter selection. Classification: 'rmse','mae','nll','error_rate','auc'. Regression: 'rmse', 'mae'. It is mandatory if resampling.method is set. No default value.
`reference.metric`	`character or list of characters, optional` A list of reference metrics. Any element of the list must be a valid option of evaluation.metric. No default value.
`parameter.range`	`list, optional` Indicates the range of parameters for selection. Each element is a list of numbers with the following structure: [<begin-value>, <step-size>, <end-value>]. All elements must be named, with names being the following valid parameters for model selection: n.estimators, max.depth, learning.rate, min.sample.weight.leaf, max.w.in.split, col.subsample.split, col.subsample.tree, lambda, alpha, scale.pos.w, base.score. Simple example for illustration - a list of two tuples list(n.estimators = c(4, 2, 10), learning.rate = c(0.1, 0.3, 1)) Valid only when parameter selection is activated.
`parameter.values`	`list, optional` Indicates the values of parameters selection. Each element must be a list of valid parameter values. All elements must be named, with names being the following valid parameters for model selection: n.estimators, max.depth, learning.rate, min.sample.weight.leaf, max.w.in.split, col.subsample.split, col.subsample.tree, lambda, alpha, scale.pos.w, base.score. Simple example for illustration - a list of two tuples list(n.estimators = c(4, 5, 6), learning.rate = c(2.0, 2.5, 3)) Valid only when parameter selection is activated.
`resampling.method`	`character, optional` Specify resampling method for model evaluation or parameter selection. `'cv'` `'stratified_cv'` `'bootstrap'` `'stratified_bootstrap'` If no value is specified for this parameter, then no model evaluation or parameter selection will be activated. No default value.
`repeat.times`	`integer, optional` Specify repeat times for resampling. Defaults to 1.
`param.search.strategy`	`character, optional` Specify value to this parameter to active parameter selection. `'grid'` `'random'` If this parameter is not set, then only model evaluation is activated. No default value.
`random.search.times`	`integer, optional` Specify times to randomly select candidate parameters for selection. Mandatory and valid when param.search.strategy is set to randomcr. No default value.
`timeout`	`integer, optional` Specify maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified. Defaults to 0.
`progress.indicator.id`	`character, optional` Set an ID of progress indicator for model evaluation or parameter selection. No progress indicator will be active if no value is provided. No default value.
`calculate.importance`	`logical, optional` Determines whether to calculate variable importance. Defaults to TRUE.
`calculate.cm`	`logical, optional` Determines whether to calculaet confusion matrix. Defaults to TURE.
`base.score`	`double, optional` Intial prediction score for all instances. Global bias for sufficient number of iterations(changing this value will not have too much effect). Defaults to 0.5 for binary classification; 0 otherwise.
`thread.ratio`	`double, optional` The ratio of available threads used for training. 0: single thread; (0,1): percentage of available threads; others : heuristically determined. Defaults to -1.
`categorical.variable`	`character or list of characters, optional` Indicates which variables are cheated as categorical. The default behaviour is: string: categorical ; integer and double: continuous. VALID only for INTEGER variables; omitted otherwise. Detected from input data.

Format

An object of class R6ClassGenerator of length 24.

Value

An "HGBTClassifier" object with the following attributes:
model DataFrame

ROW_INDEX - model row index
TREE_INDEX - tree index( -1 indicates the global information.)
MODEL_CONTENT - model content

feature.importances DataFrame

VARIABLE_NAME - Independent variable name
IMPORTANCE - Variable importance

confusion.matrix DataFrame

ACTUAL_CLASS - The actual class name
PREDICTED_CLASS - The predicted class name
COUNT - Number of records

stats DataFrame

STAT_NAME - Statistics name
STAT_VALUE - Statistics value

cv DataFrame

PARM_NAME - parameter name
INT_VALUE - integer value
DOUBLE_VALUE - double value
STRING_VALUE - character value

Examples

## Not run: 
Input DataFrame for training:

> df <- conn.context$table("PAL_TRAIN_HGBT_DATA_TBL")
> data$Collect()
    ATT1  ATT2   ATT3  ATT4 LABEL
0   1.0  10.0  100.0   1.0     A
1   1.1  10.1  100.0   1.0     A
2   1.2  10.2  100.0   1.0     A
3   1.3  10.4  100.0   1.0     A
4   1.2  10.3  100.0   1.0     A
5   4.0  40.0  400.0   4.0     B
6   4.1  40.1  400.0   4.0     B

Creating an instance of Hybrid Gradient Boosting classifier and performing the fit :

> ghc = hanaml.HGBTClassifier(conn.context = conn, data = df,
                              features = c('ATT1', 'ATT2', 'ATT3', 'ATT4'),
                              label = 'LABEL',
                              n.estimators = 4, split.threshold = 0,
                              learning.rate = 0.5, fold.num = 5, max.depth = 6,
                              evaluation.metric = 'error.rate', reference.metric = c('auc'),
                              parameter.range = list("learning.rate" = c(0.1, 1.0, 3),
                                                       "n.estimators" = c(4, 3, 10),
                                                       "split.threshold" = c(0.1, 0.3, 1.0)))

> ghc.stats$Collect()

       STAT_NAME     STAT_VALUE
0  ERROR_RATE_MEAN   0.133333
1   ERROR_RATE_VAR  0.0266666
2         AUC_MEAN        0.9

Input DataFrame for predict:

> df <- conn.context$table("PAL_TRAIN_HGBT_PREDICT_TBL")
> data$Collect()
    ID  ATT1  ATT2   ATT3  ATT4
0   1   1.0  10.0  100.0   1.0
1   2   1.1  10.1  100.0   1.0
2   3   1.2  10.2  100.0   1.0
3   4   1.3  10.4  100.0   1.0
4   5   1.2  10.3  100.0   3.0
5   6   4.0  40.0  400.0   3.0
6   7   4.1  40.1  400.0   3.0
7   8   4.2  40.2  400.0   3.0
8   9   4.3  40.4  400.0   3.0
9  10   4.2  40.3  400.0   3.0


## End(Not run)

[Package hana.ml.r version 1.0.8 Index]