hanaml.HGBTClassifier {hana.ml.r}R Documentation

Hybrid Gradient Boosting Tree Classifier

Description

Hybrid Gradient Boosting model for classification.

Usage

hanaml.HGBTClassifier(conn.context,
                         data = NULL,
                         key = NULL,
                         features = NULL,
                         label = NULL,
                         formula = NULL,
                         n.estimators = NULL,
                         random.state = NULL,
                         subsample = NULL,
                         max.depth = NULL,
                         split.threshold = NULL,
                         learning.rate = NULL,
                         split.method = NULL,
                         sketch.eps = NULL,
                         fold.num = NULL,
                         min.sample.weight.leaf = NULL,
                         min.samples.leaf = NULL,
                         max.w.in.split = NULL,
                         col.subsample.split = NULL,
                         col.subsample.tree = NULL,
                         lambda = NULL,
                         alpha = NULL,
                         evaluation.metric = NULL,
                         reference.metric = NULL,
                         parameter.range = NULL,
                         parameter.values = NULL,
                         resampling.method = NULL,
                         repeat.times = NULL,
                         param.search.strategy = NULL,
                         random.search.times = NULL,
                         timeout = NULL,
                         progress.indicator.id = NULL,
                         calculate.importance = NULL,
                         calculate.cm = NULL,
                         base.score = NULL,
                         thread.ratio = NULL,
                         categorical.variable = NULL)

Arguments

conn.context

ConnectionContext
Connection to the SAP HANA system.

data

DataFrame
DataFame containing the data.

key

character, optional
Name of the ID column.
If not provided, it is assumed that the input has no ID column.

features

character or list of characters, optional
Names of the feature columns.
If not provided, it defaults to all non-ID, non-label columns.

label

character, optional
Name of the dependent variable.
Defaults to the last column.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list>, e.g.formula = LABEL~V1+V2+V3. You can either give the formula, or a feature and label combination. Do not provide both.
Defaults to NULL, i.e. no formula shall be provided.

n.estimators

integer, optional
Total iteration number, which is equivalent to the number of trees in the final model.
Defaults to 10.

random.state

integer, optional
The seed for random number generating.
0 - current time as seed.
Others - the seed.
Defaults to 0.

subsample

double, optional
The sample rate of row (data points).
Defaults to 1.0.

max.depth

integer, optional
The maximum depth of a tree.
Defaults to 6.

split.threshold

double, optional
The minimum loss change value to make a split in tree growth (gamma in the equation).
Default to 0.

learning.rate

double, optional.
Learning rate of each iteration, must be within the range (0, 1).
Defaults to 0.3.

split.method

('exact', 'sketch', 'sampling'), optional
The method to finding split point for integeral features.
- 'exact':trying all possible points.
- 'sketch': accounting for the distribution of the sum of hessian.
- 'sampling':samples the split point randomly.
The exact method comparably has the highest test accuracy, but costs more time. On the other hand, the other two methods have relative higher computational efficiency but might lead to lower test accuracy, and are considered to be adopted as the training data set is huge.
Valid only for integer features.
Defaults to 'exact'.

sketch.eps

double, optional
The epsilon of the sketch method. It indicates that the sum of hessian between two split points is not larger than this value. That is, the number of bins is approximately 1/eps.
The less is this value, the more split points are tried.
Defaults to 0.1.

fold.num

integer, optional
Specify fold number for cross validation method.
Mandatory and valid only when resampling.method is set to cv or stratified_cv.
No default value.

min.sample.weight.leaf

double, optional
The minimum summation of sample weights (hessian) in leaf node.
Defaults to 1.0.

min.samples.leaf

integer, optional
The minimum number of data in a leaf node.
Defaults to 1.

max.w.in.split

double, optional
The maximum weight constraint assigned to each tree node.
Defaults to 0 (i.e. no constraint).

col.subsample.split

double, optional
The fraction of features used for each split, should be within range (0, 1].
Defaults to 1.0.

col.subsample.tree

double, optional
The fraction of features used for each tree growth, should be within range (0, 1]
Defaults to 1.0.

lambda

double, optional
Weight of L2 regularization for the target loss function. Should be within range [0, 1].
Defaults to 1.0.

alpha

double, optional
Weight of L1 regularization for the target loss function.
Defaults to 1.0.

evaluation.metric

character, optional
Specify evaluation metric for model evaluation or parameter selection.
Classification: 'rmse','mae','nll','error_rate','auc'.
Regression: 'rmse', 'mae'.
It is mandatory if resampling.method is set.
No default value.

reference.metric

character or list of characters, optional
A list of reference metrics.
Any element of the list must be a valid option of evaluation.metric.
No default value.

parameter.range

list, optional
Indicates the range of parameters for selection.
Each element is a list of numbers with the following structure: [<begin-value>, <step-size>, <end-value>]. All elements must be named, with names being the following valid parameters for model selection:
n.estimators, max.depth, learning.rate, min.sample.weight.leaf, max.w.in.split, col.subsample.split, col.subsample.tree, lambda, alpha, scale.pos.w, base.score. Simple example for illustration - a list of two tuples
list(n.estimators = c(4, 2, 10), learning.rate = c(0.1, 0.3, 1))
Valid only when parameter selection is activated.

parameter.values

list, optional
Indicates the values of parameters selection.
Each element must be a list of valid parameter values.
All elements must be named, with names being the following valid parameters for model selection:
n.estimators, max.depth, learning.rate, min.sample.weight.leaf, max.w.in.split, col.subsample.split, col.subsample.tree, lambda, alpha, scale.pos.w, base.score. Simple example for illustration - a list of two tuples
list(n.estimators = c(4, 5, 6), learning.rate = c(2.0, 2.5, 3)) Valid only when parameter selection is activated.

resampling.method

character, optional
Specify resampling method for model evaluation or parameter selection.

  • 'cv'

  • 'stratified_cv'

  • 'bootstrap'

  • 'stratified_bootstrap'

If no value is specified for this parameter, then no model evaluation or parameter selection will be activated.
No default value.

repeat.times

integer, optional
Specify repeat times for resampling.
Defaults to 1.

param.search.strategy

character, optional
Specify value to this parameter to active parameter selection.

  • 'grid'

  • 'random'

If this parameter is not set, then only model evaluation is activated.
No default value.

random.search.times

integer, optional
Specify times to randomly select candidate parameters for selection.
Mandatory and valid when param.search.strategy is set to randomcr.
No default value.

timeout

integer, optional
Specify maximum running time for model evaluation or parameter selection, in seconds. No timeout when 0 is specified.
Defaults to 0.

progress.indicator.id

character, optional
Set an ID of progress indicator for model evaluation or parameter selection. No progress indicator will be active if no value is provided.
No default value.

calculate.importance

logical, optional
Determines whether to calculate variable importance.
Defaults to TRUE.

calculate.cm

logical, optional
Determines whether to calculaet confusion matrix.
Defaults to TURE.

base.score

double, optional
Intial prediction score for all instances. Global bias for sufficient number of iterations(changing this value will not have too much effect).
Defaults to 0.5 for binary classification; 0 otherwise.

thread.ratio

double, optional
The ratio of available threads used for training. 0: single thread;
(0,1): percentage of available threads;
others : heuristically determined.
Defaults to -1.

categorical.variable

character or list of characters, optional
Indicates which variables are cheated as categorical.
The default behaviour is: string: categorical ; integer and double: continuous.
VALID only for INTEGER variables; omitted otherwise.
Detected from input data.

Format

An object of class R6ClassGenerator of length 24.

Value

An "HGBTClassifier" object with the following attributes:
model DataFrame

feature.importances DataFrame

confusion.matrix DataFrame

stats DataFrame

cv DataFrame

See Also

predict.HGBTClassifier

Examples

## Not run: 
Input DataFrame for training:

> df <- conn.context$table("PAL_TRAIN_HGBT_DATA_TBL")
> data$Collect()
    ATT1  ATT2   ATT3  ATT4 LABEL
0   1.0  10.0  100.0   1.0     A
1   1.1  10.1  100.0   1.0     A
2   1.2  10.2  100.0   1.0     A
3   1.3  10.4  100.0   1.0     A
4   1.2  10.3  100.0   1.0     A
5   4.0  40.0  400.0   4.0     B
6   4.1  40.1  400.0   4.0     B

Creating an instance of Hybrid Gradient Boosting classifier and performing the fit :

> ghc = hanaml.HGBTClassifier(conn.context = conn, data = df,
                              features = c('ATT1', 'ATT2', 'ATT3', 'ATT4'),
                              label = 'LABEL',
                              n.estimators = 4, split.threshold = 0,
                              learning.rate = 0.5, fold.num = 5, max.depth = 6,
                              evaluation.metric = 'error.rate', reference.metric = c('auc'),
                              parameter.range = list("learning.rate" = c(0.1, 1.0, 3),
                                                       "n.estimators" = c(4, 3, 10),
                                                       "split.threshold" = c(0.1, 0.3, 1.0)))

> ghc.stats$Collect()

       STAT_NAME     STAT_VALUE
0  ERROR_RATE_MEAN   0.133333
1   ERROR_RATE_VAR  0.0266666
2         AUC_MEAN        0.9

Input DataFrame for predict:

> df <- conn.context$table("PAL_TRAIN_HGBT_PREDICT_TBL")
> data$Collect()
    ID  ATT1  ATT2   ATT3  ATT4
0   1   1.0  10.0  100.0   1.0
1   2   1.1  10.1  100.0   1.0
2   3   1.2  10.2  100.0   1.0
3   4   1.3  10.4  100.0   1.0
4   5   1.2  10.3  100.0   3.0
5   6   4.0  40.0  400.0   3.0
6   7   4.1  40.1  400.0   3.0
7   8   4.2  40.2  400.0   3.0
8   9   4.3  40.4  400.0   3.0
9  10   4.2  40.3  400.0   3.0


## End(Not run)

[Package hana.ml.r version 1.0.8 Index]