Miscellaneous Topics

Early Stop in HGBT

Early stop is a technique that stops the training process before the model gets too complicated and overfitting of the train data.

Basically, the input dataset is splitted into two parts: training data and validation data. The algorithm trains the model on the training dataset, while keeping on monitoring the performance of the model on the validation dataset. And the generalization performance is evaluated by the specified loss function on validation dataset.

Relevant Parameters in Hybrid Gradient Boosting Tree(HGBT) Models

  • validation_set_rate

  • stratified_validation_set

  • tolerant_iter_num

Feature Grouping in HGBT

It is common that a dataset contains sparse features, which means they have a large part of insignificant data(zero or nearly zero).

A collection of features can also be group-sparse, which means at most one of them contains significant data in each data row. It usually happens in features that are measured in similar sense.

For example, 3 features A, B, C that appear as follows can be placed into the same group:

A

B

C

1.1

0.0

0.0

0.0

0.0

2.5

0.0

0.0

0.0

0.0

-10.2

0.0

The complexity to finding the exact set of features that satisfy the requirement of feature grouping is very high. In HGBT, a greedy algorithm that can find such sets approximately is employed.

The requirement of features that can group up also can be relaxed that some violations can be accepted.

Relevant Parameters in Hybrid Gradient Boosting Trees(HGBT) Models

  • feature_grouping

  • tolerant_rate

Histogram Splitting in HGBT

In the training process, a tree structure grows deeper by recursively splitting its leaf nodes. Node splitting can be optimized in many different ways, one typical approach of such is to use histograms. It is an approximate algorithm that not only reduces the time-cost but also the memory cost, and is implemented in Hybrid Gradient Boosting Tree(HGBT) models to accelerate the training process.

To be more specific, while HGBT tries to split a node in tree, it first builds histogram of that node by put feature values into bins, then evaluates splitting points by these bins. Because the number of bins is usually much fewer than the number of data points in node, this method can accelerate the splitting process significantly. Though building the histogram requires visiting all data points in node, it is still a much faster process because it only involves scanning and adding things up. Another optimization in building the histogram is that the histogram of one node can always be built by subtracting histogram of its sibling from histogram of its parent. So, we can always choose to build the histogram of node that contains less data and build histogram of its sibling by subtraction, which costs even less time.

Relevant Parameters for Histogram Splitting

The following parameters are relevant for histogram splitting in HGBT models(i.e. HybridGradientBoostingClassifier and HybridGradientBoostingRegressor) in hana_ml.algorithms.pal package:

  • split_method : Set split_method = 'histogram' if you want to use histogram splitting for acceleration.

  • max_bin_num

Note

  • As mentioned before, histogram splitting method is an approximate algorithm that does not evaluate all potential splitting points, which brings one important setting the number of bins while building histogram. The bigger the value of max_bin_num, the more potential splitting points are evaluated, and the more time is needed. The default value of max_bin_num is 256. It is suggested to use this default value first, then adjust it by the fitting result of model accordingly.

  • When it comes to categorical features, though histogram splitting method cannot be applied to them directly, HGBT will combine sparse categories if the number of categories is more than max_bin_num and reduce the number of categories after all.

Model Compression for Random Decision Trees

In order to attain better predictive performance, random decision trees(RDT) method usually requires a large number of sub-learners that are usually large-and-deep, which tends to grow with the size of the problem as well as the size of the training data.

Therefore, a model compression technique is introduced to reduce the size of model with minimum loss of the accuracy. This technique mainly involves the quantization of split values for continuous features, as well as the quantization of fitting values in leaves(for regression only).

Relevant Parameters

The following parameters are relevant for model compression in random decision trees models(i.e. RDTClassifier and RDTRegressor) in hana_ml.algorithms.pal package:

  • compression : Set as True to activate model compression

  • quantize_rate : If the largest frequency the continuous split values is less than the value specified in quantize_rate, quantization method will be used to quantize the split values of the continuous feature.

  • max_bits : Setting up the maximum number of bits to quantize continuous features, which is equivalent to use \(2^{max\_bits}\) bins. Reduce the size of bins may affect the precision of split values and the accuracy in prediction.

  • fittings_quantization : Set as True to activate the quantization of fitting values(the values in leaves) in regression problem. Recommended to do it for large dataset.

Model Compression for Support Vector Machine

Support Vector Machine(SVM) models could be large if the number of support vectors is large. In this case, model compression could be applied, which aims to reduce the size of model with minimum loss of accuracy. This technique mainly involves the quantization of values of support vectors.

Relevant Parameters

  • compression : Set as True to activate model compression

  • max_bits : Specifies the maximum number of bits to quantize values of support vectors, which is equivalent to use \(2^{max\_bits}\) bins for support vector quantization. Reduction of this number could result in degradation on the precision of support vectors and accuracy in prediction.

  • max_quantization_iter : Specifies the maximum number of iterations in the quantization process. If the specified value is too small, the quantization process may fail to converge, which will further affect the accuracy of the compressed model.

Seasonalities in Additive Model Forecast

Auto-detected Seasonalities

In additive models forecast for time-series modeling, seasonalities are presented by using partial Fourier sum to estimate periodic effects.

For natural time-series, there are three auto-detected types of seasonalities to consider: yearly, weekly and daily, such that

  • yearly seasonality is fitted if the time series is more than a year of data long

  • weekly seasonality is fitted if the time series is more than one week long

  • daily seasonality will be considered only for sub-daily time-series data

The number of terms in the partial Fourier sum determines how quickly the seasonality can change(approximately the degree of fluctuations in time-series). For reference, the default number of Fourier terms for yearly and weekly seasonalities are 10 and 3, respectively. Increasing the Fourier order allows the seasonality to fit higher-frequency changes, but can also lead to overfitting. In most cases, the default values are appropriate.

Relevant Parameters for Auto-detected Seasonalities

  • yearly_seasonality

  • weekly_seasonality

  • daily_seasonality

  • seasonality_mode

  • seasonality_prior_scale

Customized Seasonalities

In addition to three auto-detected seasonalities, user can add other seasonalities(e.g. monthly, quarterly and hourly) with parameter seasonality. It accepts List of strings in json format, where each json string should contain a name, the period of the seasonality in days, the Fourier order, and optional elements prior scale and mode for the seasonality.

Relevant Parameters for Customized Seasonalities

  • seasonality : Add customized seasonality to model in a json format, including NAME, PERIOD, FOURIER_ORDER, PRIOR_SCALE, and MODE elements. The PRIOR_SCALE and MODE are optional.

Example

If we want to model given time-series data with yearly, daily and monthly seasonalities, then

>>> amf = AdditiveModelForecast(growth='linear',
                                yearly_seasonality='true',
                                daily_seasonality='true',
                                weekly_seasonaltiy='false',#disable auto-modeling of weekly seasonality
                                seasonality='{"NAME": "MONTHLY", "PERIOD":30, "FOURIER_ORDER":5}')#added customized monthly seasonality
>>> amf.fit(data=df)