Model Evaluation and Parameter Selection

Algorithms in hana-ml that support model evaluation and parameter selection:

hana_ml.algorithms.pal.recommender.FRM

hana_ml.algorithms.pal.recommender.ALS

hana_ml.algorithms.pal.linear_model.LinearRegression

hana_ml.algorithms.pal.regression.PolynomialRegression

hana_ml.algorithms.pal.linear_model.LogisticRegression

hana_ml.algorithms.pal.neural_network.MLPClassifier

hana_ml.algorithms.pal.neural_network.MLPRegressor

hana_ml.algorithms.pal.naive_bayes.NaiveBayes

hana_ml.algorithms.pal.trees.DecisionTreeClassifier

hana_ml.algorithms.pal.trees.DecisionTreeRegressor

hana_ml.algorithms.pal.svm.SVC

hana_ml.algorithms.pal.svm.SVR

hana_ml.algorithms.pal.svm.SVRanking

hana_ml.algorithms.pal.svm.OneClassSVM

hana_ml.algorithms.pal.neighbors.KNNClassifier

hana_ml.algorithms.pal.neighbors.KNNRegressor

hana_ml.algorithms.pal.trees.HybridGradientBoostingClassifier

hana_ml.algorithms.pal.trees.HybridGradientBoostingRegressor

hana_ml.algorithms.pal.regression.GLM

Key Relevant Parameters for Model Evaluation and Parameter Selection

resampling_methods

evaluation_metric

fold_num

repeat_times

search_strategy

random_search_times

Resampling Methods

You can use some resampling methods perform model selection and optimal parameter selection for the model.

In hana_ml.algorithms.pal package(also SAP HANA PAL), two kinds of resampling methods are provided:

Cross Validation(CV): Divides training data into k-folds as evenly as possible. Each time one fold is left out as the evaluation dataset, while the remaining k-1 fold dat is treated as the training dataset. Any cross-validation based resampling method can be specified by parameter resampling_method with 'cv' as substring(e.g. 'cv', 'stratified_cv').

Bootstrap: Randomly sample entities of same quantity from original dataset with replacement, and use these resampled entities as training data, while the remaining as evaluation data. Any bootstrap based resampling method can be specified by parameter resampling_method with 'bootstrap' as a substring(e.g. 'bootstrap', 'stratified_bootstrap').

For classification problems, to make training and evaluation data set have similar distribution to original one in class label, both resampling methods provide their stratified version(e.g. 'cv' \(\rightarrow\) 'stratified_cv', 'bootstrap_sha' \(\rightarrow\) 'stratified_bootstrap_sha').

To reduce variability, one can perform multiple rounds of evaluation(by specifying parameter repeat_times) and combine the results. This is especially recommended if bootstrap based resampling methods are applied.

Search Strategies

While activating parameter selection, there are two search strategies:

Grid: Each parameter to be selected has determined candidates when it is defined via either value set or range with step. Every possible combination of all parameters is evaluated and the one with best result is chosen.

Random: Each parameter to be selected has either a discrete value set or a range. Each time a proper value of each parameter is chosen, and the combination of parameters is evaluated. In this case, you must specify the number of trying rounds. The combination with best result is chosen.

Successive Halving and Hyperband for Parameter Selection

To accelerate the process of parameter selection, a general idea called early stopping is considered.

A key observation in parameter selection is that: most of the candidates in search space only have poor performance, yet they all cost the same resource. It leads to an intuitive optimization to filtering out those unpromising candidates at relatively low cost, and focusing on the more promising ones.

Successive halving(SHA) is based on such idea. It begins with a low resource configuration, then reduce the number of candidates by a certain rate. It then increases the resource configuration and repeat previous step. This process continues until there is only a single candidate left.

The definition of resource in successive halving can vary for different algorithms. The default resource is the size of dataset. Besides, each algorithm can define their own specific resource that is usually some regular parameter of it, like n_estimators in HybridGradientBoostingClassifier.

Some parameters can be sensitive to successive halving because of their nature. One classic example is parameter learning_rate, by which the speed of convergence is influenced directly. So, if the resource is number of iterations in an iterative learning algorithm, it will be unfair to small learning rate.

Because of the nature of successive halving, it will not always pick the best candidate w.r.t. the specified evaluation metric, but highly likely a good one.

It is quite hard for successive halving to determine how many resources is needed for a specific parameter search task. On other hand, hyperband can to some extent resolve this issue. Basically hyperband will try several configurations regarding resource and search space, then calls successive halving to detect the best one. Currently it only supports random search strategies.

Key Relevant Parameters

resampling_method: Specifying a resampling method that ends with 'sha' or 'hyperband' to activate SHA/Hyperband method for parameter selection.

resource: Specifies the resource type. Options are algorithm dependent.

max_resource: Maximum resource that should be used in iteration(usually indicating the resource used in the last iteration).

min_resource_rate: Minimum resource by rate that should be used in iteration(usually indicating the resource used in the first iteration).

reduction_rate: The rate of elimination of candidates in each iteration, which is also the rate of increment of the resource assigned to each candidate.

aggressive_elimination: This parameter is used to balance the reduction of candidates and increasing of resource when their quantities do not match. This case happens when there are still bunch of candidates to be searched, while the resource reaches its upper limit. If aggressive elimination is applied, the lower bound of resource limit will be used multiple times firstly to reduce number of candidates to a certain level.