HybridGradientBoostingRegressor
- class hana_ml.algorithms.pal.trees.HybridGradientBoostingRegressor(n_estimators=None, random_state=None, subsample=None, max_depth=None, split_threshold=None, learning_rate=None, split_method=None, sketch_eps=None, fold_num=None, min_sample_weight_leaf=None, min_samples_leaf=None, max_w_in_split=None, col_subsample_split=None, col_subsample_tree=None, lamb=None, alpha=None, adopt_prior=None, evaluation_metric=None, cv_metric=None, ref_metric=None, calculate_importance=None, thread_ratio=None, resampling_method=None, param_search_strategy=None, repeat_times=None, timeout=None, progress_indicator_id=None, random_search_times=None, param_range=None, cross_validation_range=None, param_values=None, obj_func=None, tweedie_power=None, replace_missing=None, default_missing_direction=None, feature_grouping=None, tol_rate=None, compression=None, max_bits=None, max_bin_num=None, resource=None, max_resource=None, reduction_rate=None, min_resource_rate=None, aggressive_elimination=None, validation_set_rate=None, stratified_validation_set=None, tolerant_iter_num=None, fg_min_zero_rate=None, huber_slope=None, base_score=None, validation_set_metric=None)
Hybrid Gradient Boosting model for regression.
- Parameters:
- n_estimatorsint, optional
Specifies the number of trees in Gradient Boosting.
Defaults to 10.
- split_method{'exact', 'sketch', 'sampling', 'histogram'}, optional
The method to find split point for numeric features.
'exact': the exact method, trying all possible points.
'sketch': the sketch method, accounting for the distribution of the sum of hessian.
'sampling': samples the split point randomly.
'histogram': builds histogram upon data and uses it as split point.
Defaults to 'exact'.
- random_stateint, optional
The seed for random number generating.
0: current time as seed.
Others : the seed.
Defaults to 0.
- max_depthint, optional
The maximum depth of a tree.
Defaults to 6.
- split_thresholdfloat, optional
Specifies the stopping condition: if the improvement value of the best split is less than this value, then the tree stops growing.
Defaults to 0.
- learning_ratefloat, optional.
Learning rate of each iteration, must be within the range (0, 1].
Defaults to 0.3.
- subsamplefloat, optional
The fraction of samples to be used for fitting each base learner.
Defaults to 1.0.
- fold_numint, optional
The k value for k-fold cross-validation.
Mandatory and valid only when
resampling_method
is set as 'cv', 'cv_sha' or 'cv_hyperband'.- sketch_epsfloat, optional
The value of the sketch method which sets up an upper limit for the sum of sample weights between two split points.
Basically, the less this value is, the more number of split points are tried.
Defaults to 0.1.
- min_sample_weight_leaffloat, optional
The minimum summation of ample weights in a leaf node.
Defaults to 1.0.
- min_sample_leafint, optional
The minimum number of data in a leaf node.
Defaults to 1.
- max_w_in_splitfloat, optional
The maximum weight constraint assigned to each tree node.
Defaults to 0 (i.e. no constraint).
- col_subsample_splitfloat, optional
The fraction of features used for each split, should be within range (0, 1].
Defaults to 1.0.
- col_subsample_treefloat, optional
The fraction of features used for each tree growth, should be within range (0, 1].
Defaults to 1.0.
- lambfloat, optional
Weight of L2 regularization for the target loss function.
Should be within range (0, 1].
Defaults to 1.0.
- alphafloat, optional
Weight of L1 regularization for the target loss function.
Defaults to 1.0.
- adopt_priorbool, optional
For regression problems, this parameter specifies whether or not to use the average value of the training data as the initial prediction score.
base_score
is ignored if this parameter is set to True.Defaults to False.
- evaluation_metric{'rmse', 'mae'}, optional
The evaluation metric used for model evaluation or parameter selection.
Mandatory if
resampling_method
is set.- cv_metric{'rmse', 'mae'}, optional(deprecated)
Same as
evaluation_metric
.Will be deprecated in future release.
- ref_metricstr or a list of str, optional
Specifies a reference metric or a list of reference metrics.
Any reference metric must be a valid option of
evaluation_metric
.No default value.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to -1.
- calculate_importancebool, optional
Determines whether to calculate variable importance.
Defaults to True.
- calculate_cmbool, optional
Determines whether to calculate confusion matrix.
Defaults to True.
- resampling_method{'cv', 'cv_sha', 'cv_hyperband', 'bootstrap', 'bootstrap_sha', 'bootstrap_hyperband'}, optional
Specifies the resampling method for model evaluation or parameter selection.
If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.
No default value.
Note
Resampling methods that end with 'sha' or 'hyperband' are used for parameter selection only, not for model evaluation.
- param_search_strategy: {'grid', 'random'}, optional
The search strategy for parameters.
Mandatory if
resampling_method
is specified and ends with 'sha'.Defaults to 'random' and cannot be changed if
resampling_method
is specified and ends with 'hyperband'; otherwise no default value, and parameter selection cannot be carried out if not specified.- repeat_timesint, optional
Specifies the repeat times for resampling.
Defaults to 1.
- random_search_timesint, optional
Specify number of times to randomly select candidate parameters in parameter selection.
Mandatory and valid only when
param_search_strategy
is 'random'.No default value.
- timeoutint, optional
Specify maximum running time for model evaluation/parameter selection in seconds.
Defaults to 0, which means no timeout.
- progress_indicator_idstr, optional
Set an ID of progress indicator for model evaluation or parameter selection.
No progress indicator will be active if no value is provided.
No default value.
- param_rangedict or ListOfTuples, optional
Specifies the range of parameters involved for parameter selection.
Valid only when
resampling_method
andparam_search_strategy
are both specified.If input is list of tuples, then each tuple must be a pair, with the first being parameter name of str type, and the second being the a list of numbers with the following structure:
[<begin-value>, <step-size>, <end-value>].
<step-size> can be omitted if
param_search_strategy
is 'random'.Otherwise, if input is dict, then the key of each element must specify a parameter name, while the value of each element specifies the range of that parameter.
Supported parameters for range specification:
n_estimators
,max_depth
,learning_rate
,min_sample_weight_leaf
,max_w_in_split
,col_subsample_split
,col_subsample_tree
,lamb
,alpha
,base_score
.A simple example for illustration:
[('n_estimators', [4, 3, 10]), ('learning_rate', [0.1, 0.3, 1.0])],
or
{'n_estimators': [4, 3, 10], 'learning_rate' : [0.1, 0.3, 1.0]}.
No default value.
- cross_validation_rangelist of tuples, optional(deprecated)
Same as
param_range
.Will be deprecated in future release.
- param_valuesdict or ListOfTuples, optional
Specifies the values of parameters involved for parameter selection.
Valid only when
resampling_method
andparam_search_strategy
are both specified.If input is list of tuple, then each tuple must be a pair, with the first being parameter name of str type, and the second be a list values for that parameter.
Otherwise, if input is dict, then the key of each element must specify a parameter name, while the value of each element specifies list of values of that parameter.
Supported parameters for values specification are same as those valid for range specification, see
param_range
.A simple example for illustration:
[('n_estimators', [4, 7, 10]), ('learning_rate', [0.1, 0.4, 0.7, 1.0])],
or
{'n_estimators' : [4, 7, 10], 'learning_rate' : [0.1, 0.4, 0.7, 1.0]}.
No default value.
- obj_funcstr, optional
Specifies the objective function to optimize, with valid options listed as follows:
'se' : Squared error
'ae' : Absolute error(with iterative reweighted-least-square solver)
'sle' : Squared-log error
'huber' : Huber loss function
'pseudo-huber' : Pseudo Huber loss function
'gamma' : Gamma objective function
'tweedie' : Tweedie objective function
Defaults to 'se'.
- tweedie_powerfloat, optional
Specifies the power for tweedie objective function, with valid range [1.0, 2.0].
Only valid when
obj_func
is 'tweedie'.Defaults to 1.5.
- replace_missingbool, optional
Specifies whether or not to replace missing value by another value in the feature.
If True, the replacement value is the mean value for a continuous feature, and the mode(i.e. most frequent value) for a categorical feature.
Defaults to True.
- default_missing_direction{'left', 'right'}, optional
Define the default direction where missing value will go to while tree splitting.
Defaults to 'right'.
- feature_groupingbool, optional
Specifies whether or not to group sparse features that contains only one significant value in each row.
Defaults to False.
- tol_ratefloat, optional
While feature grouping is enabled, still merging features when there are rows containing more than one significant value. This parameter specifies the rate of such rows allowed.
Valid only when
feature_grouping
is set as True.Defaults to 0.0001.
- compressionbool, optional
Indicates whether or not the trained model should be compressed.
Defaults to False.
- max_bitsint, optional
Specifies the maximum number of bits to quantize continuous features, which is equivalent to use \(2^{max\_bits}\) bins.
Valid only when
compression
is set as True, and must be less than 31.Defaults to 12.
- max_bin_numint, optional
Specifies the maximum bin number for histogram method.
Decreasing this number gains better performance in terms of running time at a cost of accuracy degradation.
Only valid when
split_method
is set to 'histogram'.Defaults to 256.
- resourcestr, optional
Specifies the resource type used in successive-halving(SHA) and hyperband algorithm for parameter selection.
Currently there are two valid options: 'n_estimators' and 'data_size'.
Mandatory and valid only when
resampling_method
is set as one of the following: 'cv_sha', 'bootstrap_sha', 'cv_hyperband', 'bootstrap_hyperband'.Defaults to 'data_size'.
- max_resourceint, optional
Specifies the maximum number of estimators that should be used in SHA or Hyperband method.
Mandatory when
resource
is set as 'n_estimators', and invalid ifresampling_method
does not take one of the following values: 'cv_sha', 'bootstrap_sha', 'cv_hyperband', 'bootstrap_hyperband'.- reduction_ratefloat, optional
Specifies reduction rate in SHA or Hyperband method.
For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0
Valid only when
resampling_method
takes one of the following values: 'cv_sha', 'bootstrap_sha', 'cv_hyperband', 'bootstrap_hyperband'.Defaults to 3.0.
- min_resource_ratefloat, optional
Specifies the minimum resource rate that should be used in SHA or Hyperband iteration.
Valid only when
resampling_method
takes one of the following values: 'cv_sha', 'cv_hyperband', 'bootstrap_sha', 'bootstrap_hyperband'.Defaults to:
0.0 if
resource
is set as 'data_size'(the default value)1/
max_resource
ifresource
is set as 'n_estimators'.
- aggressive_eliminationbool, optional
Specifies whether to apply aggressive elimination while using SHA method.
Aggressive elimination happens when the data size and parameters size to be searched does not match and there are still bunch of parameters to be searched while data size reaches its upper limits. If aggressive elimination is applied, lower bound of limit of data size will be used multiple times first to reduce number of parameters.
Valid only when
resampling_method
is 'cv_sha' or 'bootstrap_sha'.Defaults to False.
- validation_set_ratefloat, optional
Specifies the sampling rate of validation set for model evaluation in early stopping.
Valid range is [0, 1).
Need to specify a positive value to activate early stopping.
Defaults to 0.
- stratified_validation_setbool, optional
Specifies whether or not to apply stratified sampling for getting the validation set for early stopping.
Valid only when
validation_set_rate
is specified with a positive value.Defaults to False.
- tolerant_iter_numint, optional
Specifies the number of successive deteriorating iterations before early stopping.
Valid only when
validation_set_rate
is specified with a positive value.Defaults to 10.
- fg_min_zero_ratefloat, optional
Specifies the minimum zero rate that is used to indicate sparse columns for feature grouping.
Valid only when
feature_grouping
is True.Defaults to 0.5.
- huber_slopefloat, optional
Specifies the slope parameter in Huber loss function or pseudo-Huber loss function.
The value must be greater than 0.
Valid only when
obj_func
is set as 'huber' or 'pseudo-huber'.Defaults to 1.0.
- base_scorefloat, optional
Specifies the initial prediction score of the training data.
Ignored if
adopt_prior
is set as True.Default value dependents on the choice of
obj_func
specified.- validation_set_metricstr, optional
Specifies the metric used to evaluate the validation dataset for early-stop. Valid options are listed as follows:
'rmse'
'mae'
If not specified, the value of obj_func will be used.
References
Examples
>>> hgbr = HybridGradientBoostingRegressor( ... n_estimators=20, split_threshold=0.75, ... split_method='exact', learning_rate=0.75, ... fold_num=5, max_depth=6, ... resampling_method='cv', ... param_search_strategy='grid', ... evaluation_metric = 'rmse', ref_metric=['mae'], ... param_range=[('learning_rate',[0.01, 0.25, 1.0]), ... ('n_estimators', [10, 1, 20]), ... ('split_threshold', [0.01, 0.25, 1.0])])
Preform fit():
>>> hgbr.fit(data=df_train, features=['F1','F2'], label='TARGET')
Preform predict():
>>> res = hgbr.predict(data=df_predict, key='ID', verbose=False) >>> res.collect()
Preform score():
>>> res = hgbr.score(data=df_score) >>> res.collect()
- Attributes:
- model_DataFrame
Model content.
- feature_importances_DataFrame
The feature importance (the higher, the more import the feature).
- stats_DataFrame
Statistics.
- selected_param_DataFrame
Best parameters obtained from parameter selection.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, label, ...])Fit the model to the training dataset.
Get the model metrics.
Get the score metrics.
predict
(data[, key, features, verbose, ...])Predict dependent variable values based on a fitted model.
score
(data[, key, features, label, ...])Returns the regression score based on specified score type.
set_model_state
(state)Set the model state by state information.
- create_model_state(model=None, function=None, pal_funcname='PAL_HGBT', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for HGBT.
- pal_funcnameint or str, optional
PAL function name.
Defaults to 'PAL_HGBT'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- fit(data, key=None, features=None, label=None, categorical_variable=None, warm_start=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- warm_startbool, optional
When set to True, reuse the
model_
of current object to fit and add more trees to the existing model. Otherwise, just fit a new model.Defaults to False.
- Returns:
- A fitted object of class "HybridGradientBoostingRegressor".
- predict(data, key=None, features=None, verbose=None, thread_ratio=None, missing_replacement=None)
Predict dependent variable values based on a fitted model.
- Parameters:
- dataDataFrame
Independent variable values to predict for.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns. If not provided, it defaults to all non-ID columns.
- missing_replacementstr, optional
The missing replacement strategy:
'feature_marginalized': marginalise each missing feature out independently.
'instance_marginalized': marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to 'feature_marginalized'.
- verbosebool, optional(deprecated)
If true, output all classes and the corresponding confidences for each data point.
Invalid for regression problem and will be removed in future release.
- Returns:
- DataFrame
DataFrame of score and confidence, structured as follows:
ID column, with same name and type as
data
's ID column.SCORE, type DOUBLE, representing the predicted classes.
CONFIDENCE, type DOUBLE, all None for regression prediction.
- score(data, key=None, features=None, label=None, missing_replacement=None, score_type=None, tweedie_power=None)
Returns the regression score based on specified score type.
- Parameters:
- dataDataFrame
Data on which to assess model performance.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- missing_replacementstr, optional
The missing replacement strategy:
'feature_marginalized': marginalise each missing feature out independently.
'instance_marginalized': marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to feature_marginalized.
- score_type{'r2', 'r2-log', 'ae', 'gamma', 'tweedie'}, optional
Specifies the type of regression score to be computed.
'r2' : r2 score
'r2-log' : r2 log score
'ae' : absolute error score
'gamma' : gamma score
'tweedie' : tweedie score
Default value depends on the value of
obj_func
specified when training the model:obj_func
= 'se' : defaults to 'r2'.obj_func
= 'sle' : defaults to 'r2-log'.obj_func
= 'huber' or 'pseudo-huber' : defaults to 'ae'.obj_func
= 'gamma' : defaults to 'gamma'.obj_func
= 'tweedie' : defaults to 'tweedie'.
- tweedie_powerfloat, optional
Specifies the power parameter for Tweedie regression, with valid range (1.0, 2.0).
Valid only when
score_type
is 'tweedie'.Defaults to 1.5 if self.tweedie_power is None, else defaults to self.tweedie_power.
- Returns:
- float
The regression score calculated base on the given data.
Inherited Methods from PALBase
Besides those methods mentioned above, the HybridGradientBoostingRegressor class also inherits methods from PALBase class, please refer to PAL Base for more details.