HybridGradientBoostingClassifier

class hana_ml.algorithms.pal.trees.HybridGradientBoostingClassifier(n_estimators=None, random_state=None, subsample=None, max_depth=None, split_threshold=None, learning_rate=None, split_method=None, sketch_eps=None, fold_num=None, min_sample_weight_leaf=None, min_samples_leaf=None, max_w_in_split=None, col_subsample_split=None, col_subsample_tree=None, lamb=None, alpha=None, base_score=None, adopt_prior=None, evaluation_metric=None, cv_metric=None, ref_metric=None, calculate_importance=None, calculate_cm=None, thread_ratio=None, resampling_method=None, param_search_strategy=None, repeat_times=None, timeout=None, progress_indicator_id=None, random_search_times=None, param_range=None, cross_validation_range=None, param_values=None, obj_func=None, replace_missing=None, default_missing_direction=None, feature_grouping=None, tol_rate=None, compression=None, max_bits=None, max_bin_num=None, resource=None, max_resource=None, reduction_rate=None, min_resource_rate=None, aggressive_elimination=None, validation_set_rate=None, stratified_validation_set=None, tolerant_iter_num=None, fg_min_zero_rate=None)

Hybrid Gradient Boosting trees model for classification.

Parameters
n_estimatorsint, optional

Specifies the number of trees in Gradient Boosting.

Defaults to 10.

split_method{'exact', 'sketch', 'sampling', 'histogram'}, optional

The method to finding split point for numerical features.

  • 'exact': the exact method, trying all possible points

  • 'sketch': the sketch method, accounting for the distribution of the sum of hessian

  • 'sampling': samples the split point randomly

  • 'histogram': builds histogram upon data and uses it as split point

Defaults to 'exact'.

random_stateint, optional

The seed for random number generating.

  • 0: current time as seed.

  • Others : the seed.

max_depthint, optional

The maximum depth of a tree.

Defaults to 6.

split_thresholdfloat, optional

Specifies the stopping condition: if the improvement value of the best split is less than this value, then the tree stops growing.

learning_ratefloat, optional.

Learning rate of each iteration, must be within the range (0, 1].

Defaults to 0.3.

subsamplefloat, optional

The fraction of samples to be used for fitting each base learner.

Defaults to 1.0.

fold_numint, optional

The k value for k-fold cross-validation.

Mandatory and valid only when resampling_method is set as of one the following: 'cv', 'cv_sha', 'cv_hyperband', 'stratified_cv', 'stratified_cv_sha', 'stratified_cv_hyperband'.

sketch_epsfloat, optional

The value of the sketch method which sets up an upper limit for the sum of sample weights between two split points.

Basically, the less this value is, the more number of split points are tried.

min_sample_weight_leaffloat, optional

The minimum summation of ample weights in a leaf node.

Defaults to 1.0.

min_samples_leafint, optional

The minimum number of data in a leaf node.

Defaults to 1.

max_w_in_splitfloat, optional

The maximum weight constraint assigned to each tree node.

Defaults to 0 (i.e. no constraint).

col_subsample_splitfloat, optional

The fraction of features used for each split, should be within range (0, 1].

Defaults to 1.0.

col_subsample_treefloat, optional

The fraction of features used for each tree growth, should be within range (0, 1]

Defaults to 1.0.

lambfloat, optional

L2 regularization weight for the target loss function. Should be within range (0, 1].

Defaults to 1.0.

alphafloat, optional

Weight of L1 regularization for the target loss function.

Defaults to 1.0.

base_scorefloat, optional

Initial prediction score for all instances. Global bias for sufficient number of iterations(changing this value will not have too much effect).

Defaults to 0.5.

adopt_priorbool, optional

Indicates whether to adopt the prior distribution as the initial point.

Frequencies of class labels are used for classification problems.

base_score is ignored if this parameter is set to True.

Defaults to False.

evaluation_metric{'rmse', 'mae', 'nll', 'error_rate', 'auc'}, optional

The metric used for model evaluation or parameter selection.

Mandatory if resampling_method is set.

cv_metric{'rmse', 'mae', 'nll', 'error_rate', 'auc'}, optional (deprecated)

Same as evaluation_metric.

Will be deprecated in future release.

ref_metricstr or list of str, optional

Specifies a reference metric or a list of reference metrics. Any reference metric must be a valid option of evaluation_metric.

Defaults to ['error_rate'].

thread_ratiofloat, optional

The ratio of available threads used for training.

  • 0: single thread;

  • (0,1]: percentage of available threads;

  • others : heuristically determined.

Defaults to -1.

calculate_importancebool, optional

Determines whether to calculate variable importance.

Defaults to True.

calculate_cmbool, optional

Determines whether to calculate confusion matrix.

Defaults to True.

resampling_methodstr, optional

Specifies the resampling method for model evaluation or parameter selection.

Valid options include: 'cv', 'stratified_cv', 'bootstrap', 'stratified_bootstrap', 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha', 'cv_hyperband', 'stratified_cv_hyperband', 'bootstrap_hyperband', 'stratified_bootstrap_hyperband'.

If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.

No default value.

Note

Resampling methods that end with 'sha' or 'hyperband' are used for parameter selection only, not for model evaluation.

param_search_strategy: {'grid', 'random'}, optional

The search strategy for parameter selection.

Mandatory if resampling_method is specified and ends with 'sha'.

Defaults to 'random' and cannot be changed if resampling_method is specified and ends with 'hyperband'; otherwise no default value, and parameter selection cannot be carried out if not specified.

repeat_timesint, optional

Specifies the repeat times for resampling.

Defaults to 1.

random_search_timesint, optional

Specify number of times to randomly select candidate parameters in parameter selection.

Mandatory and valid only when param_search_strategy is set to 'random'.

timeoutint, optional

Specify maximum running time for model evaluation/parameter selection in seconds.

Defaults to 0, which means no timeout.

progress_indicator_idstr, optional

Set an ID of progress indicator for model evaluation or parameter selection.

No progress indicator will be active if no value is provided.

param_rangedict or ListOfTuples, optional

Specifies the range of parameters involved for parameter selection.

Valid only when resampling_method and param_search_strategy are both specified.

If input is list of tuples, then each tuple must be a pair, with the first being parameter name of str type, and the second being the a list of numbers with the following structure:

[<begin-value>, <step-size>, <end-value>].

<step-size> can be omitted if param_search_strategy is 'random'.

Otherwise, if input is dict, then the key of each element must specify a parameter name, while the value of each element specifies the range of that parameter.

Supported parameters for range specification: n_estimators, max_depth, learning_rate, min_sample_weight_leaf, max_w_in_split, col_subsample_split, col_subsample_tree, lamb, alpha, scale_pos_w, base_score.

A simple example for illustration:

[('n_estimators', [4, 3, 10]), ('learning_rate', [0.1, 0.3, 1.0])],

or

{'n_estimators': [4, 3, 10], 'learning_rate' : [0.1, 0.3, 1.0]}.

No default value.

cross_validation_rangelist of tuples, optional(deprecated)

Same as param_range.

Will be deprecated in future release.

param_valuesdict or ListOfTuples, optional

Specifies the values of parameters involved for parameter selection.

Valid only when resampling_method and param_search_strategy are both specified.

If input is list of tuple, then each tuple must be a pair, with the first being parameter name of str type, and the second be a list values for that parameter.

Otherwise, if input is dict, then the key of each element must specify a parameter name, while the value of each element specifies list of values of that parameter.

Supported parameters for values specification are same as those valid for range specification, see param_range.

A simple example for illustration:

[('n_estimators', [4, 7, 10]), ('learning_rate', [0.1, 0.4, 0.7, 1.0])],

or

{'n_estimators' : [4, 7, 10], 'learning_rate' : [0.1, 0.4, 0.7, 1.0]}.

No default value.

obj_funcstr, optional

Specifies the objective function to optimize, with valid options listed as follows:

  • 'logistic' : The objective function for logistic regression(for binary classification)

  • 'hinge' : The Hinge loss function(for binary classification)

  • 'softmax' : The softmax function for multi-class classification

Defaults to 'logistic' for binary classification, and 'softmax' for multi-class classification.

replace_missingbool, optional

Specifies whether or not to replace missing value by another value in the feature.

If True, the replacement value is the mean value for a continuous feature, and the mode(i.e. most frequent value) for a categorical feature.

Defaults to True.

default_missing_direction{'left', 'right'}, optional

Define the default direction where missing value will go to while tree splitting.

Defaults to 'right'.

feature_groupingbool, optional

Specifies whether or not to group sparse features that contains only one significant value in each row.

Defaults to False.

tol_ratefloat, optional

While feature grouping is enabled, still merging features when there are rows containing more than one significant value. This parameter specifies the rate of such rows allowed.

Valid only when feature_grouping is set as True.

Defaults to 0.0001.

compressionbool, optional

Indicates whether or not the trained model should be compressed.

Defaults to False.

max_bitsint, optional

Specifies the maximum number of bits to quantize continuous features, which is equivalent to use \(2^{max\_bits}\) bins.

Valid only when compression is set as True, and must be less than 31.

Defaults to 12.

max_bin_numint, optional

Specifies the maximum bin number for histogram method.

Decreasing this number gains better performance in terms of running time at a cost of accuracy degradation.

Only valid when split_method is set to 'histogram'.

Defaults to 256.

resourcestr, optional

Specifies the resource type used in successive-halving(SHA) and hyperband algorithm for parameter selection.

Currently there are two valid options: 'n_estimators' and 'data_size'.

Mandatory and valid only when resampling_method is set as one of the following values: 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha', 'cv_hyperband', 'stratified_cv_hyperband', 'bootstrap_hyperband', 'stratified_bootstrap_hyperband'.

Defaults to 'data_size'.

max_resourceint, optional

Specifies the maximum number of estimators that should be used in SHA or Hyperband method.

Mandatory when resource is set as 'n_estimators', and invalid if resampling_method does not take one of the following values: 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha', 'cv_hyperband', 'stratified_cv_hyperband', 'bootstrap_hyperband', 'stratified_bootstrap_hyperband'.

reduction_ratefloat, optional

Specifies reduction rate in SHA or Hyperband method.

For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0

Valid only when resampling_method takes one of the following values: 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha', 'cv_hyperband', 'stratified_cv_hyperband', 'bootstrap_hyperband', 'stratified_bootstrap_hyperband'.

Defaults to 3.0.

min_resource_ratefloat, optional

Specifies the minimum resource rate that should be used in SHA or Hyperband iteration.

Valid only when resampling_method takes one of the following values: 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha', 'cv_hyperband', 'stratified_cv_hyperband', 'bootstrap_hyperband', 'stratified_bootstrap_hyperband'.

Defaults to:

  • 0.0 if resource is set as 'data_size'(i.e. the default value)

  • 1/max_resource if resource is set as 'n_estimators'.

aggressive_eliminationbool, optional

Specifies whether to apply aggressive elimination while using SHA method.

Aggressive elimination happens when the data size and parameters size to be searched does not match and there are still bunch of parameters to be searched while data size reaches its upper limits. If aggressive elimination is applied, lower bound of limit of data size will be used multiple times first to reduce number of parameters.

Valid only when resampling_method is set as one of the following: 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha'.

Defaults to False.

validation_set_ratefloat, optional

Specifies the sampling rate of validation set for model evaluation in early stopping.

Valid range is [0, 1).

Need to specify a positive value to activate early stopping.

Defaults to 0.

stratified_validation_setbool, optional

Specifies whether or not to apply stratified sampling for getting the validation set for early stopping.

Valid only when validation_set_rate is specified with a positive value.

Defaults to False.

tolerant_iter_numint, optional

Specifies the number of successive deteriorating iterations before early stopping.

Valid only when validation_set_rate is specified with a positive value.

Defaults to 10.

fg_min_zero_ratefloat, optional

Specifies the minimum zero rate that is used to indicate sparse columns for feature grouping.

Valid only when feature_grouping is True.

Defaults to 0.5.

References

Examples

Input dataframe for training:

>>> df.head(7).collect()
   ATT1  ATT2   ATT3  ATT4 LABEL
0   1.0  10.0  100.0   1.0     A
1   1.1  10.1  100.0   1.0     A
2   1.2  10.2  100.0   1.0     A
3   1.3  10.4  100.0   1.0     A
4   1.2  10.3  100.0   1.0     A
5   4.0  40.0  400.0   4.0     B
6   4.1  40.1  400.0   4.0     B

Creating an instance of Hybrid Gradient Boosting Classifier:

>>> hgbc = HybridGradientBoostingClassifier(
...           n_estimators = 4, split_threshold=0,
...           learning_rate=0.5, fold_num=5, max_depth=6,
...           evaluation_metric = 'error_rate', ref_metric=['auc'],
...           param_range=[('learning_rate',[0.1, 0.45, 1.0]),
...                        ('n_estimators', [4, 3, 10]),
...                        ('split_threshold', [0.1, 0.45, 1.0])])

Performing fit() on the given dataframe:

>>> hgbc.fit(df, features=['ATT1', 'ATT2', 'ATT3', 'ATT4'], label='LABEL')
>>> hgbc.stats_.collect()
         STAT_NAME STAT_VALUE
0  ERROR_RATE_MEAN   0.133333
1   ERROR_RATE_VAR  0.0266666
2         AUC_MEAN        0.9

Input dataframe for predict:

>>> df_predict.collect()
   ID  ATT1  ATT2   ATT3  ATT4
0   1   1.0  10.0  100.0   1.0
1   2   1.1  10.1  100.0   1.0
2   3   1.2  10.2  100.0   1.0
3   4   1.3  10.4  100.0   1.0
4   5   1.2  10.3  100.0   3.0
5   6   4.0  40.0  400.0   3.0
6   7   4.1  40.1  400.0   3.0
7   8   4.2  40.2  400.0   3.0
8   9   4.3  40.4  400.0   3.0
9  10   4.2  40.3  400.0   3.0

Performing predict() on given dataframe:

>>> result = hgbc.fit(df_predict, key='ID', verbose=False)
>>> result.collect()
   ID SCORE  CONFIDENCE
0   1     A    0.852674
1   2     A    0.852674
2   3     A    0.852674
3   4     A    0.852674
4   5     A    0.751394
5   6     B    0.703119
6   7     B    0.703119
7   8     B    0.703119
8   9     B    0.830549
9  10     B    0.703119
Attributes
model_DataFrame

Trained model content.

feature_importances_DataFrame

The feature importance (the higher, the more import the feature)

confusion_matrix_DataFrame

Confusion matrix used to evaluate the performance of classification algorithm.

stats_DataFrame

Statistics info.

selected_param_DataFrame

Best choice of parameter selected.

Methods

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

fit(data[, key, features, label, ...])

Train the model on input data.

predict(data[, key, features, verbose, ...])

Predict labels based on the trained HGBT classifier.

score(data[, key, features, label, ...])

Returns the mean accuracy on the given test data and labels.

set_model_state(state)

Set the model state by state information.

fit(data, key=None, features=None, label=None, categorical_variable=None, warm_start=None)

Train the model on input data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID column.

If key is not provided, then:

  • if data is indexed by a single column, then key defaults to that index column;

  • otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

categorical_variablestr or list of str, optional

Indicates INTEGER variable(s) that should be treated as categorical.

Valid only for INTEGER variables, omitted otherwise.

Note

By default INTEGER variables are treated as numerical.

warm_startbool, optional

When set to True, reuse the model_ of current object to fit and add more trees to the existing model. Otherwise, just fit a new model.

Defaults to False.

Returns
Fitted object.
predict(data, key=None, features=None, verbose=None, thread_ratio=None, missing_replacement=None)

Predict labels based on the trained HGBT classifier.

Parameters
dataDataFrame

Independent variable values to predict for.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

missing_replacementstr, optional

The missing replacement strategy:

  • 'feature_marginalized': marginalise each missing feature out independently.

  • 'instance_marginalized': marginalise all missing features in an instance as a whole corr

verbosebool, optional

If True, output all classes and the corresponding confidences for each data point.

Defaults to False.

Returns
DataFrame

DataFrame of score and confidence, structured as follows:

  • ID column, with same name and type as data's ID column.

  • SCORE, type DOUBLE, representing the predicted classes/values.

  • CONFIDENCE, type DOUBLE, representing the confidence of a class label assignment.

score(data, key=None, features=None, label=None, missing_replacement=None)

Returns the mean accuracy on the given test data and labels.

Parameters
dataDataFrame

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

missing_replacementstr, optional

The missing replacement strategy:

  • 'feature_marginalized': marginalise each missing feature out independently.

  • 'instance_marginalized': marginalise all missing features in an instance as a whole corresponding to each category.

Defaults to 'feature_marginalized'.

Returns
float

Mean accuracy on the given test data and labels.

create_model_state(model=None, function=None, pal_funcname='PAL_HGBT', state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for HGBT.

pal_funcnameint or str, optional

PAL function name.

Defaults to 'PAL_HGBT'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

Parameters
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the HybridGradientBoostingClassifier class also inherits methods from PALBase class, please refer to PAL Base for more details.