Model Evaluation and Parameter Selection
Algorithms in hana-ml that support model evaluation and parameter selection:
Key Relevant Parameters for Model Evaluation and Parameter Selection
resampling_methods
evaluation_metric
fold_num
repeat_times
search_strategy
random_search_times
Resampling Methods
You can use some resampling methods perform model selection and optimal parameter selection for the model.
In hana_ml.algorithms.pal package(also SAP HANA PAL), two kinds of resampling methods are provided:
Cross Validation(CV): Divides training data into k-folds as evenly as possible. Each time one fold is left out as the evaluation dataset, while the remaining k-1 fold dat is treated as the training dataset. Any cross-validation based resampling method can be specified by parameter
resampling_method
with 'cv' as substring(e.g. 'cv', 'stratified_cv').Bootstrap: Randomly sample entities of same quantity from original dataset with replacement, and use these resampled entities as training data, while the remaining as evaluation data. Any bootstrap based resampling method can be specified by parameter
resampling_method
with 'bootstrap' as a substring(e.g. 'bootstrap', 'stratified_bootstrap').
For classification problems, to make training and evaluation data set have similar distribution to original one in class label, both resampling methods provide their stratified version(e.g. 'cv' \(\rightarrow\) 'stratified_cv', 'bootstrap_sha' \(\rightarrow\) 'stratified_bootstrap_sha').
To reduce variability, one can perform multiple rounds of evaluation(by specifying parameter repeat_times
) and combine the results.
This is especially recommended if bootstrap based resampling methods are applied.
Search Strategies
While activating parameter selection, there are two search strategies:
Grid: Each parameter to be selected has determined candidates when it is defined via either value set or range with step. Every possible combination of all parameters is evaluated and the one with best result is chosen.
Random: Each parameter to be selected has either a discrete value set or a range. Each time a proper value of each parameter is chosen, and the combination of parameters is evaluated. In this case, you must specify the number of trying rounds. The combination with best result is chosen.
Successive Halving and Hyperband for Parameter Selection
To accelerate the process of parameter selection, a general idea called early stopping is considered.
A key observation in parameter selection is that: most of the candidates in search space only have poor performance, yet they all cost the same resource. It leads to an intuitive optimization to filtering out those unpromising candidates at relatively low cost, and focusing on the more promising ones.
Successive halving(SHA) is based on such idea. It begins with a low resource configuration, then reduce the number of candidates by a certain rate. It then increases the resource configuration and repeat previous step. This process continues until there is only a single candidate left.
The definition of resource in successive halving can vary for different algorithms.
The default resource is the size of dataset. Besides, each algorithm can define their own
specific resource that is usually some regular parameter of it, like n_estimators
in HybridGradientBoostingClassifier.
Some parameters can be sensitive to successive halving because of their nature.
One classic example is parameter learning_rate
, by which the speed of convergence is influenced directly.
So, if the resource is number of iterations in an iterative learning algorithm, it will be unfair to small learning rate.
Because of the nature of successive halving, it will not always pick the best candidate w.r.t. the specified evaluation metric, but highly likely a good one.
It is quite hard for successive halving to determine how many resources is needed for a specific parameter search task. On other hand, hyperband can to some extent resolve this issue. Basically hyperband will try several configurations regarding resource and search space, then calls successive halving to detect the best one. Currently it only supports random search strategies.
Key Relevant Parameters
resampling_method
: Specifying a resampling method that ends with 'sha' or 'hyperband' to activate SHA/Hyperband method for parameter selection.
resource
: Specifies the resource type. Options are algorithm dependent.
max_resource
: Maximum resource that should be used in iteration(usually indicating the resource used in the last iteration).
min_resource_rate
: Minimum resource by rate that should be used in iteration(usually indicating the resource used in the first iteration).
reduction_rate
: The rate of elimination of candidates in each iteration, which is also the rate of increment of the resource assigned to each candidate.
aggressive_elimination
: This parameter is used to balance the reduction of candidates and increasing of resource when their quantities do not match. This case happens when there are still bunch of candidates to be searched, while the resource reaches its upper limit. If aggressive elimination is applied, the lower bound of resource limit will be used multiple times firstly to reduce number of candidates to a certain level.