Parameters for Missing Value Handling in HANA DataFrame
strategy
: {'non', 'most_frequent-mean', 'most_frequent-median', 'most_frequent-zero', 'most_frequent-als', 'delete'}, optionalSpecifies the overall imputation strategy for the input training data.
'non' : No imputation for all columns.
'most_frequent-mean' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its mean.
'most_frequent-median' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its median.
'most_frequent-zero' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in all numerical columns by zeros.
'most_frequent-als' : Replacing missing values in any categorical column by its most frequently observed value, and filling the missing values in all numerical columns via a matrix completion technique called alternating least squares.
'delete' : Delete all rows with missing values.
Defaults to 'most_frequent-mean'.
strategy_by_col
: ListOfTuples, optionalSpecifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.
Each tuple in the list should contain at least two elements, such that:
the 1st element is the name of a column;
the 2nd element is the imputation strategy of that column, valid strategies include: 'non', 'delete', 'most_frequent', 'categorical_const', 'mean', 'median', 'numerical_const', 'als'.
If the imputation strategy is 'categorical_const' or 'numerical_const', then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column.
An example for illustration: [('V1', 'categorical_const', '0'), ('V5','median')]
Defaults to None.
Note
The following parameters all have pre-fix 'als_', and are invoked only when 'als' is selected as a valid imputation strategy in either
strategy
orstrategy_by_col
. Those parameters are for setting up the alternating-least-square(ALS) model for data imputation.als_factors
: int, optionalLength of factor vectors in the ALS model.
It should be less than the number of numerical columns, so that the imputation results would be meaningful.
Defaults to 3.
als_lambda
: float, optionalL2 regularization applied to the factors in the ALS model.
Should be non-negative.
Defaults to 0.01.
als_maxit
: int, optionalMaximum number of iterations for solving the ALS model.
Defaults to 20.
als_randomstate
: int, optionalSpecifies the seed of the random number generator used in the training of ALS model:
0: Uses the current time as the seed,
Others: Uses the specified value as the seed.
Defaults to 0.
als_exit_threshold
: float, optionalSpecify a value for stopping the training of ALS model. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.
0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.
Defaults to 0.
als_exit_interval
: int, optionalSpecify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified
exit_threshold
is reached.Defaults to 5.
als_linsolver
: {'cholesky', 'cg'}, optionalLinear system solver for the ALS model.
'cholesky' is usually much faster.
'cg' is recommended when
als_factors
is large.
Defaults to 'cholesky'.
als_maxit
: int, optionalSpecifies the maximum number of iterations for cg algorithm.
Invoked only when the 'cg' is the chosen linear system solver for ALS.
Defaults to 3.
als_centering
: bool, optionalWhether to center the data by column before training the ALS model.
Defaults to True.
als_scaling
: bool, optionalWhether to scale the data by column before training the ALS model.
Defaults to True.