hanaml.HGBTRegressor.Rd
hanaml.HGBTRegressor is a R wrapper for SAP HANA PAL HGBT.
hanaml.HGBTRegressor(
data = NULL,
key = NULL,
features = NULL,
label = NULL,
formula = NULL,
n.estimators = NULL,
random.state = NULL,
subsample = NULL,
max.depth = NULL,
split.threshold = NULL,
learning.rate = NULL,
split.method = NULL,
sketch.eps = NULL,
fold.num = NULL,
min.sample.weight.leaf = NULL,
min.samples.leaf = NULL,
max.w.in.split = NULL,
col.subsample.split = NULL,
col.subsample.tree = NULL,
lambda = NULL,
alpha = NULL,
adopt.prior = NULL,
evaluation.metric = NULL,
reference.metric = NULL,
parameter.range = NULL,
parameter.values = NULL,
resampling.method = NULL,
repeat.times = NULL,
param.search.strategy = NULL,
random.search.times = NULL,
timeout = NULL,
progress.indicator.id = NULL,
calculate.importance = NULL,
base.score = NULL,
thread.ratio = NULL,
categorical.variable = NULL,
obj.func = NULL,
tweedie.power = NULL,
replace.missing = NULL,
default.missing.direction = NULL,
feature.grouping = NULL,
tol.rate = NULL,
compression = NULL,
max.bits = NULL,
model = NULL,
warm.start = NULL,
max.bin.num = NULL,
resource = NULL,
max.resource = NULL,
min.resource.rate = NULL,
reduction.rate = NULL,
aggressive.elimination = NULL,
validation.set.rate = NULL,
tolerant.iter.num = NULL
)
DataFrame
DataFrame containting the data.
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.
formula type, optional
Formula to be used for model generation.
format = label~<feature_list>
e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula,
or a feature and label combination, but do not provide both.
Defaults to NULL.
integer, optional
Total iteration number, which is equivalent to the number of trees in the final model.
Defaults to 10.
integer, optional
The seed for random number generating.
0 - current time as seed.
Others - the seed.
Defaults to 0.
double, optional
The sample rate of row (data points).
Defaults to 1.0.
integer, optional
The maximum depth of a tree.
Defaults to 6.
double, optional
The minimum loss change value to make a split in tree growth (gamma in the equation).
Default to 0.
double, optional.
Learning rate of each iteration, must be within the range (0, 1).
Defaults to 0.3.
c("exact", "sketch", "sampling", "histogram"), optional
The method to finding split point for integeral features.
"exact": trying all possible points.
"sketch": accounting for the distribution of the sum of hessian.
"sampling": samples the split point randomly.
"histogram": builds histogram upon data and uses it as split point.
The exact method comparably has the highest test accuracy, but costs more time.
On the other hand, the other three methods have relative higher computational efficiency but might
lead to lower test accuracy, and are considered to be adopted as the training data set is huge.
Valid only for numerical features.
Defaults to "exact".
double, optional
The epsilon of the sketch method.
It indicates that the sum of hessian between two split points is not larger than this value.
That is, the number of bins is approximately 1/eps.
The less is this value, the more split points are tried.
Defaults to 0.1.
integer, optional
Specify fold number for cross validation method.
Mandatory and valid only when resampling.method
is set to "cv", "cv_sha" or
"cv_hyperband".
No default value.
double, optional
The minimum summation of sample weights (hessian) in leaf node.
Defaults to 1.0.
integer, optional
The minimum number of data in a leaf node.
Defaults to 1.
double, optional
The maximum weight constraint assigned to each tree node.
Defaults to 0 (i.e. no constraint).
double, optional
The fraction of features used for each split,
should be within range (0, 1].
Defaults to 1.0.
double, optional
The fraction of features used for each tree
growth, should be within range (0, 1]
Defaults to 1.0.
double, optional
Weight of L2 regularization for the target loss function.
Should be within range [0, 1].
Defaults to 1.0.
double, optional
Weight of L1 regularization for the target loss function.
Defaults to 1.0.
logical, optional
Indicates whether to adopt the prior distribution as the initial point.
To be specific, for a regression problem the average value is used.
base.score is ignored if this parameter is set to TRUE.
Defaults to FALSE.
character, optional
Specify evaluation metric for model evaluation or parameter selection.
Valid values include: "rmse", "mae".
It is mandatory if resampling.method
is set.
No default value.
character or list of characters, optional
A list of reference metrics.
Any element of the list must be a valid option of evaluation.metric.
No default value.
list, optional
Indicates the range of parameters for selection.
Each element is a vector of numbers with the following structure:
c(<begin-value>, <step-size>, <end-value>).
All elements must be named, with names being the following valid parameters for model selection:
n.estimators, max.depth, learning.rate,
min.sample.weight.leaf, max.w.in.split, col.subsample.split, col.subsample.tree,
lambda, alpha, scale.pos.w, base.score.
Simple example for illustration - a list of two vectors:
list(n.estimators = c(4, 2, 10),
learning.rate = c(0.1, 0.3, 1))
Valid only when parameter selection is activated.
list, optional
Indicates the values of parameters selection.
Each element must be a vector of valid parameter values.
All elements must be named, with names being the following valid parameters for model selection:
n.estimators, max.depth, learning.rate,
min.sample.weight.leaf, max.w.in.split, col.subsample.split, col.subsample.tree,
lambda, alpha, scale.pos.w, base.score.
Simple example for illustration - a list of two vectors
list(n.estimators = c(4, 5, 6),
learning.rate = c(2.0, 2.5, 3))
Valid only when parameter selection is activated.
character, optional
Specify resampling method for model evaluation or parameter selection.
Valid options include: "cv", "cv_sha", "cv_hyperband", "bootstrap",
"bootstrap_sha", "bootstrap_hyperband".
Resampling methods that end with "sha"(short for successive-halving) or "hyperband" are
for parameter selection only, not for model evaluation.
If no value is specified for this parameter,
then no model evaluation nor parameter selection will be activated.
No default value.
integer, optional
Specify repeat times for resampling.
Defaults to 1.
character, optional
Specify value to this parameter to active parameter selection.
"grid"
"random"
If this parameter is not set, then only model evaluation is activated.
No default value.
integer, optional
Specify times to randomly select candidate parameters for selection.
Mandatory and valid when param.search.strategy
is set to 'random'.
No default value.
integer, optional
Specify maximum running time for model evaluation or parameter selection,
in seconds. No timeout when 0 is specified.
Defaults to 0.
character, optional
Set an ID of progress indicator for model evaluation or parameter selection.
No progress indicator will be active if no value is provided.
No default value.
logical, optional
Determines whether to calculate variable importance.
Defaults to TRUE.
double, optional
Initial prediction score for all instances.
Global bias for sufficient number
of iterations(changing this value will not have too much effect).
Defaults to 0.5 for binary classification; 0 otherwise.
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads. Values between 0 and 1 will use up to
that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads
used is then be heuristically determined.
Defaults to -1.
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.
character, optional
Specifies the objective function to optimize, valid options include:
"se" : Squared error loss function
"sle" : Squared log error loss function
"pseduo.huber" : Pseudo Huber error loss function
"gamma" : Gamma objective function
"tweedie" : Tweedie objective function
Defaults to "se".
numeric, optional
Specifies the power value for tweedie objective function, with valid range [1.0, 2.0].
Valid only when obj.func
is "tweedie".
Defaults to 1.5.
logical, optional
Specifies whether or not to replace missing values by another value in a feature.
If set as TRUE, then the replacement value is the mean value for a continuous feature, and
the mode(i.e. most frequent) value for a categorical feature.
Defaults to TRUE.
c("left", "right"), optional
Specifies the default direction where missing value will go to while tree splitting.
Defaults to "right".
logical, optional
Indicates whether or not to group sparse features that only contains one significant value in each row.
Defaults to FALSE.
numeric, optional
While applying feature grouping, features are still merged when there are rows containing more
than one significant value only if the rate of such rows does not exceed the value specified
in tol.rate
.
Defaults to 0.0001.
logical, optional
Indicates whether or not the trained model should be compressed.
Defaults to FALSE.
integer, optional
Specifies the maximum number of bits to quantize continuous features.
Equivalent to use 2max.bits bins.
The value must be less than 31.
Defaults to 12.
DataFrame, optional
The model used for warm start.
Defaults to NULL.
logical, optional
When set to TRUE, use the model and train more trees to the exisiting model with new input data.
If no model is provided, an error will be prompted.
Defaults to FALSE.
integer, optional
Specifies the maximum bin number for histogram method.
Decreasing this number gains better performance in terms of running time at a cost of accuracy.
Valid only when split.method
is "histogram".
Defaults to 256.
c("data.size", "n.estimators"), optional
Specifies the resource type for successive-halving(SHA) or hyperband method:
"data.size": size of the input data as resource type.
"n.estimators": number of trees in the final estimator as resource type.
Valid only when resampling.method
is specified and ends with either "sha"
or "hyperband"(e.g. "cv_sha", "bootstrap_hyperband").
Defaults to "data.size".
integer, optional
Specifies the maximum number of trees allowed in use for SHA or hyperband method.
Mandatory when resource
is set as "n.estimators", and valid when resampling.method
is specified and ends with either "sha" or "hyperband"(e.g. "cv_sha",
"bootstrap_hyperband").
No default value.
numeric, optional
Specifies the minimum resource rate that should be used in SHA or hyperband iteration.
Valid only when resampling.method
is specified with a valid option that ends with "sha"
or "hyperband"(e.g. "cv_sha", "bootstrap_hyperband").
Defaults to 0.0 if resource
is specified as "data.size"(or not specified), and defaults
to 1/max.resource
when resource
is specified as "n.estimators".
numeric, optional
Specifies reduction rate in SHA or Hyperband method.
Valid when resampling.method
is specified and ends with either "sha" or "hyperband"(e.g. "cv_sha",
"bootstrap_hyperband").
Defaults to 3.0.
logical, optional
Specifies whether to apply aggressive elimination while using SHA method.
FALSE: do not apply aggressive elimination.
TRUE: apply aggressive elimination.
Valid only when resampling.method
is specified and ends with "sha".
Defaults to FALSE.
Note: Aggressive elimination happens when the data size and parameters size
to be searched does not match, and there are still bunch of parameters to be searched
while data size reaches its upper limits. If aggressive elimination is applied,
lower bound of limit of data size will be used multiple times first
to reduce number of parameters.
numeric, optional
Specifies the rate of validation set that be sampled from data set.
If 0.0 is set, then no early stop will be applied.
Defaults to 0.0.
integer, optional
Specifies how many consecutive deteriorated iterations should be observed before applying early stop.
Valid only when validation.set.rate
is greater than 0.0.
Defaults to 10.
An R6 object of class "HGBTRegressor" with the following attributes and methods:
Attributesmodel: DataFrame
ROW_INDEX
- model row index
TREE_INDEX
- tree index( -1 indicates the global information.)
MODEL_CONTENT
- model content
feature.importances: DataFrame
VARIABLE_NAME
- Independent variable name
IMPORTANCE
- Variable importance
stats: DataFrame
STAT_NAME
- Statistics name
STAT_VALUE
- Statistics value
cv: DataFrame
PARM_NAME
- parameter name
INT_VALUE
- integer value
DOUBLE_VALUE
- double value
STRING_VALUE
- character value
Methods
CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)
Usage:
> hgr <- hanaml.HGBTRegressor(data=df)
> hgr$CreateModelState()
Arguments:
model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model
.
algorithm: character
Specifies the PAL algorithm associated with model
.
Defaults to self$pal.algorithm
.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func
.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model
.
Defaults to FALSE.
After calling this method, an attribute state
that contains the parsed info for model
shall be assigned
to the corresponding R6 object.
DeleteModelState(state=NULL)
Usage:
Assuming we have trained a hanaml
model and created its model state, like the following:
> hgr <- hanaml.HGBTRegressor(data=df)
> hgr$CreateModelState()
After using the model state for real-time scoring, we can delete the state by calling:
> hgr$DelateModelState()
Arguments:
state: DataFrame
DataFrame containing the state info.
Defaults to self$state
.
After calling this method, the specified model state shall be cleaned up and associated memory be released.
Input DataFrame data for training:
> data$Collect()
ATT1 ATT2 ATT3 ATT4 TARGET
1 19.76 6235.0 100.00 100.00 25.10
2 17.85 46230.0 43.67 84.53 19.23
3 19.96 7360.0 65.51 81.57 21.42
4 16.80 28715.0 45.16 93.33 18.11
5 18.20 21934.0 49.20 83.07 19.24
6 16.71 1337.0 74.84 94.99 19.31
7 18.81 17881.0 70.66 92.34 20.07
Call the function:
> hgr <- HGBTRegressor(data,
features = c("ATT1","ATT2","ATT3", "ATT4"),
label = "TARGET",
n.estimators = 20, split.threshold = 0.75,
split.method = "exact", learning.rate = 0.75,
fold.num = 5, max.depth = 6,
evaluation.metric = "rmse", reference.metric = c("mae"),
parameter.range = list("learning.rate" = c(0.25, 1.0, 4),
"n.estimators" = c(10, 1, 20),
"split.threshold" = c(0.0, 0.2, 1.0)))
Output:
> hgr$feature.importances$Collect()
VARIABLE_NAME IMPORTANCE
1 ATT1 0.744019
2 ATT2 0.164429
3 ATT3 0.078935
4 ATT4 0.012617
If you want to use the warm.start, you could provide the trained model like hgr$model:
> hgr2 <- hanaml.HGBTRegressor(data = df.reg,
key = "ID",
n.estimators = 6,
model = hgr$model,
warm.start = TRUE)
It is common that data set contains sparse features, which means they have a large part of insignificant data(zero or nearly zero).
A set of features can also be sparse, which means at most one of them contains significant data in each data row.
It usually happens in features that are measured in similar sense.
For example 3 features A, B and C appearing like following:
A | B | C |
1.1 | 0.0 | 0.0 |
0.0 | 0.0 | 2.5 |
0.0 | 0.0 | 0.0 |
0.0 | -10.2 | 0.0 |
In above case, feature A, B, and C can group up as one that for each row only one datum is registered.
It can both reduce memory usage and accelerate the training process.
It is quite a complicated algorithm to find exact set of features that satisfy the requirement of feature grouping.
HGBT employs a greedy algorithm that can find such sets approximately.
The requirement of features that can group up also can be relaxed that some violations can be accepted.
Relevant Parameters: feature.grouping, tol.rate
Must specify feature.grouping
= TRUE to activate feature grouping.
Specify the maximum ratio of rows that can violate the requirement for feature grouping using
tol.rate
.
One optimization while HGBT splitting nodes uses histogram to accelerate the training process.
It is an approximate algorithm that not only reduces the time cost but also reduces the memory cost.
To be specific, while HGBT tries to split a node in tree,
it first builds histogram of that node by put feature values into bins,
then evaluates splitting points by these bins.
Because the number of bins is usually a lot fewer than the number of data in node,
this method can accelerate the splitting process a lot. Though building the histogram
still needs to visit all data in node, but it is a faster process because it only involves scanning and adding things up.
Another optimization in building the histogram is that histogram of one node can always be built by subtracting histogram
of its sibling from histogram of its parent.
So, we can always choose to build the histogram of node that contains less data and build histogram of its sibling by subtraction,
which costs less time.
Relevant Parameters: split.method, max.bin.num
.
Need to set split.method
= "histogram" so as to use histogram splitting.
As mentioned before, histogram splitting is an approximate algorithm that does not evaluate all potential splitting points,
so setting the number of bins while building histogram becomes important.
Parameter max.bin.num
is exactly for this purpose. The bigger this parameter is set,
the more potential splitting points are evaluated, and the more time is needed.
The default value of max.bin.num
is 256. It is suggested to use this default value first,
then adjust it by the fitting result of model accordingly.
When it comes to categorical features, though histogram splitting cannot be applied to them directly,
HGBT will combine sparse categories if the number of categories is more than max.bin.num
,
and reduce the number of categories after all.
Early stop is a technique to stop model training before it gets too complicated
and overfits the training data.
Basically it continuously monitors generalization performance of the model on another independent dataset
called the validation dataset. In HGBT, the validation dataset is obtained by sampling from
the input dataset(while the rest part is used for training).
Relevant Parameters: validation.set.rate, stratified.validation.set
(for classification only),
and tolerant.iter.num
Parameter validation.set.rate
determines the sampling rate of the validation dataset
from the input data.
Parameter stratified.validation.set
determines whether or not to apply stratified sampling
method w.r.t. class label of the input data when sampling the validation dataset from the input data.
This parameter is applicable to classification only.
Parameter tolerant.iter.num
determines the number of successive deteriorating iterations
before early stopping.