Miscellaneous Topics
Early Stop in HGBT
Early stop is a technique that stops the training process before the model gets too complicated and overfitting of the train data.
Basically, the input dataset is splitted into two parts: training data and validation data. The algorithm trains the model on the training dataset, while keeping on monitoring the performance of the model on the validation dataset. And the generalization performance is evaluated by the specified loss function on validation dataset.
Relevant Parameters in Hybrid Gradient Boosting Tree(HGBT) Models
validation_set_rate
stratified_validation_set
tolerant_iter_num
Feature Grouping in HGBT
It is common that a dataset contains sparse features, which means they have a large part of insignificant data(zero or nearly zero).
A collection of features can also be group-sparse, which means at most one of them contains significant data in each data row. It usually happens in features that are measured in similar sense.
For example, 3 features A, B, C that appear as follows can be placed into the same group:
A |
B |
C |
---|---|---|
1.1 |
0.0 |
0.0 |
0.0 |
0.0 |
2.5 |
0.0 |
0.0 |
0.0 |
0.0 |
-10.2 |
0.0 |
The complexity to finding the exact set of features that satisfy the requirement of feature grouping is very high. In HGBT, a greedy algorithm that can find such sets approximately is employed.
The requirement of features that can group up also can be relaxed that some violations can be accepted.
Relevant Parameters in Hybrid Gradient Boosting Trees(HGBT) Models
feature_grouping
tolerant_rate
Histogram Splitting in HGBT
In the training process, a tree structure grows deeper by recursively splitting its leaf nodes. Node splitting can be optimized in many different ways, one typical approach of such is to use histograms. It is an approximate algorithm that not only reduces the time-cost but also the memory cost, and is implemented in Hybrid Gradient Boosting Tree(HGBT) models to accelerate the training process.
To be more specific, while HGBT tries to split a node in tree, it first builds histogram of that node by put feature values into bins, then evaluates splitting points by these bins. Because the number of bins is usually much fewer than the number of data points in node, this method can accelerate the splitting process significantly. Though building the histogram requires visiting all data points in node, it is still a much faster process because it only involves scanning and adding things up. Another optimization in building the histogram is that the histogram of one node can always be built by subtracting histogram of its sibling from histogram of its parent. So, we can always choose to build the histogram of node that contains less data and build histogram of its sibling by subtraction, which costs even less time.
Relevant Parameters for Histogram Splitting
The following parameters are relevant for histogram splitting in HGBT models(i.e. HybridGradientBoostingClassifier and HybridGradientBoostingRegressor) in hana_ml.algorithms.pal package:
split_method
: Setsplit_method = 'histogram'
if you want to use histogram splitting for acceleration.
max_bin_num
Note
As mentioned before, histogram splitting method is an approximate algorithm that does not evaluate all potential splitting points, which brings one important setting the number of bins while building histogram. The bigger the value of
max_bin_num
, the more potential splitting points are evaluated, and the more time is needed. The default value ofmax_bin_num
is 256. It is suggested to use this default value first, then adjust it by the fitting result of model accordingly.When it comes to categorical features, though histogram splitting method cannot be applied to them directly, HGBT will combine sparse categories if the number of categories is more than
max_bin_num
and reduce the number of categories after all.
Model Compression for Random Decision Trees
In order to attain better predictive performance, random decision trees(RDT) method usually requires a large number of sub-learners that are usually large-and-deep, which tends to grow with the size of the problem as well as the size of the training data.
Therefore, a model compression technique is introduced to reduce the size of model with minimum loss of the accuracy. This technique mainly involves the quantization of split values for continuous features, as well as the quantization of fitting values in leaves(for regression only).
Relevant Parameters
The following parameters are relevant for model compression in random decision trees models(i.e. RDTClassifier and RDTRegressor) in hana_ml.algorithms.pal package:
compression
: Set as True to activate model compression
quantize_rate
: If the largest frequency the continuous split values is less than the value specified inquantize_rate
, quantization method will be used to quantize the split values of the continuous feature.
max_bits
: Setting up the maximum number of bits to quantize continuous features, which is equivalent to use \(2^{max\_bits}\) bins. Reduce the size of bins may affect the precision of split values and the accuracy in prediction.
fittings_quantization
: Set as True to activate the quantization of fitting values(the values in leaves) in regression problem. Recommended to do it for large dataset.
Model Compression for Support Vector Machine
Support Vector Machine(SVM) models could be large if the number of support vectors is large. In this case, model compression could be applied, which aims to reduce the size of model with minimum loss of accuracy. This technique mainly involves the quantization of values of support vectors.
Relevant Parameters
compression
: Set as True to activate model compression
max_bits
: Specifies the maximum number of bits to quantize values of support vectors, which is equivalent to use \(2^{max\_bits}\) bins for support vector quantization. Reduction of this number could result in degradation on the precision of support vectors and accuracy in prediction.
max_quantization_iter
: Specifies the maximum number of iterations in the quantization process. If the specified value is too small, the quantization process may fail to converge, which will further affect the accuracy of the compressed model.
Seasonalities in Additive Model Forecast
Auto-detected Seasonalities
In additive models forecast for time-series modeling, seasonalities are presented by using partial Fourier sum to estimate periodic effects.
For natural time-series, there are three auto-detected types of seasonalities to consider: yearly, weekly and daily, such that
yearly seasonality is fitted if the time series is more than a year of data long
weekly seasonality is fitted if the time series is more than one week long
daily seasonality will be considered only for sub-daily time-series data
The number of terms in the partial Fourier sum determines how quickly the seasonality can change(approximately the degree of fluctuations in time-series). For reference, the default number of Fourier terms for yearly and weekly seasonalities are 10 and 3, respectively. Increasing the Fourier order allows the seasonality to fit higher-frequency changes, but can also lead to overfitting. In most cases, the default values are appropriate.
Relevant Parameters for Auto-detected Seasonalities
yearly_seasonality
weekly_seasonality
daily_seasonality
seasonality_mode
seasonality_prior_scale
Customized Seasonalities
In addition to three auto-detected seasonalities, user can add other seasonalities(e.g. monthly, quarterly and hourly)
with parameter seasonality
.
It accepts List of strings in json format, where each json string should contain a name, the period of the seasonality in days,
the Fourier order, and optional elements prior scale and mode for the seasonality.
Relevant Parameters for Customized Seasonalities
seasonality
: Add customized seasonality to model in a json format, including NAME, PERIOD, FOURIER_ORDER, PRIOR_SCALE, and MODE elements. The PRIOR_SCALE and MODE are optional.
Example
If we want to model given time-series data with yearly, daily and monthly seasonalities, then
>>> amf = AdditiveModelForecast(growth='linear',
yearly_seasonality='true',
daily_seasonality='true',
weekly_seasonaltiy='false',#disable auto-modeling of weekly seasonality
seasonality='{"NAME": "MONTHLY", "PERIOD":30, "FOURIER_ORDER":5}')#added customized monthly seasonality
>>> amf.fit(data=df)