Imputer
- class hana_ml.algorithms.pal.preprocessing.Imputer(strategy=None, als_factors=None, als_lambda=None, als_maxit=None, als_randomstate=None, als_exit_threshold=None, als_exit_interval=None, als_linsolver=None, als_cg_maxit=None, als_centering=None, als_scaling=None, thread_ratio=None)
Missing value imputation for DataFrame.
- Parameters
- strategy{'non', 'most_frequent-mean', 'most_frequent-median', 'most_frequent-zero', 'most_frequent_als', 'delete'}, optional
Specifies the overall imputation strategy.
'non' : No imputation for all columns.
'most_frequent-mean' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its mean.
'most_frequent-median' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its median.
'most_frequent-zero' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in all numerical columns by zeros.
'most_frequent-als' : Replacing missing values in any categorical column by its most frequently observed value, and filling the missing values in all numerical columns via a matrix completion technique called alternating least squares.
'delete' : Delete all rows with missing values.
Defaults to 'most_frequent-mean'.
- thread_ratiofloat, optional
Controls the proportion of available threads to use.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use up to that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.0.
Note
The following parameters all have pre-fix 'als_', and are invoked only when 'als' is the overall imputation strategy. Those parameters are for setting up the alternating-least-square(ALS) model for data imputation.
- als_factorsint, optional
Length of factor vectors in the ALS model.
It should be less than the number of numerical columns, so that the imputation results would be meaningful.
Defaults to 3.
- als_lambdafloat, optional
L2 regularization applied to the factors in the ALS model.
Should be non-negative.
Defaults to 0.01.
- als_maxitint, optional
Maximum number of iterations for solving the ALS model.
Defaults to 20.
- als_randomstateint, optional
Specifies the seed of the random number generator used in the training of ALS model:
0: Uses the current time as the seed,
Others: Uses the specified value as the seed.
Defaults to 0.
- als_exit_thresholdfloat, optional
Specify a value for stopping the training of ALS model. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.
0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.
Defaults to 0.
- als_exit_intervalint, optional
Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified
exit_threshold
is reached.Defaults to 5.
- als_linsolver{'cholesky', 'cg'}, optional
Linear system solver for the ALS model.
'cholesky' is usually much faster.
'cg' is recommended when
als_factors
is large.
Defaults to 'cholesky'.
- als_maxitint, optional
Specifies the maximum number of iterations for cg algorithm.
Invoked only when the 'cg' is the chosen linear system solver for ALS.
Defaults to 3.
- als_centeringbool, optional
Whether to center the data by column before training the ALS model.
Defaults to True.
- als_scalingbool, optional
Whether to scale the data by column before training the ALS model.
Defaults to True.
Examples
Input DataFrame df:
>>> df.head(5).collect() V0 V1 V2 V3 V4 V5 0 10 0.0 D NaN 1.4 23.6 1 20 1.0 A 0.4 1.3 21.8 2 50 1.0 C NaN 1.6 21.9 3 30 NaN B 0.8 1.7 22.6 4 10 0.0 A 0.2 NaN NaN
Create an Imputer instance using 'mean' strategy and call fit:
>>> impute = Imputer(strategy='most_frequent-mean') >>> result = impute.fit_transform(df, categorical_variable=['V1'], ... strategy_by_col=[('V1', 'categorical_const', '0')])
>>> result.head(5).collect() V0 V1 V2 V3 V4 V5 0 10 0 D 0.507692 1.400000 23.600000 1 20 1 A 0.400000 1.300000 21.800000 2 50 1 C 0.507692 1.600000 21.900000 3 30 0 B 0.800000 1.700000 22.600000 4 10 0 A 0.200000 1.469231 20.646154
The stats/model content of input DataFrame:
>>> impute.head(5).collect() STAT_NAME STAT_VALUE 0 V0.NUMBER_OF_NULLS 3 1 V0.IMPUTATION_TYPE MEAN 2 V0.IMPUTED_VALUE 24 3 V1.NUMBER_OF_NULLS 2 4 V1.IMPUTATION_TYPE SPECIFIED_CATEGORICAL_VALUE
The above stats/model content of the input DataFrame can be applied to imputing another DataFrame with the same data structure, e.g. consider the following DataFrame with missing values:
>>> df1.collect() ID V0 V1 V2 V3 V4 V5 0 0 20.0 1.0 B NaN 1.5 21.7 1 1 40.0 1.0 None 0.6 1.2 24.3 2 2 NaN 0.0 D NaN 1.8 22.6 3 3 50.0 NaN C 0.7 1.1 NaN 4 4 20.0 1.0 A 0.3 NaN 20.6
With attribute impute being obtained, one can impute the missing values of df1 via the following line of code, and then check the result:
>>> result1, statistics = impute.transform(data=df1, key='ID') >>> result1.collect() ID V0 V1 V2 V3 V4 V5 0 0 20 1 B 0.507692 1.500000 21.700000 1 1 40 1 A 0.600000 1.200000 24.300000 2 2 24 0 D 0.507692 1.800000 22.600000 3 3 50 0 C 0.700000 1.100000 20.646154 4 4 20 1 A 0.300000 1.469231 20.600000
Create an Imputer instance using other strategies, e.g. 'als' strategy and then call fit:
>>> impute = Imputer(strategy='als', als_factors=2, als_randomstate=1)
Output:
>>> result2 = impute.fit_transform(data=df, categorical_variable=['V1'])
>>> result2.head(5).collect() V0 V1 V2 V3 V4 V5 0 10 0 D 0.306957 1.400000 23.600000 1 20 1 A 0.400000 1.300000 21.800000 2 50 1 C 0.930689 1.600000 21.900000 3 30 0 B 0.800000 1.700000 22.600000 4 10 0 A 0.200000 1.333668 21.371753
- Attributes
- model_DataFrame
statistics/model content.
Methods
fit_transform
(data[, key, ...])Impute the missing values of a DataFrame, return the result, and collect the related statistics/model info for imputation.
transform
(data[, key, thread_ratio])The function imputes missing values a DataFrame using statistic/model info collected from another DataFrame.
- fit_transform(data, key=None, categorical_variable=None, strategy_by_col=None)
Impute the missing values of a DataFrame, return the result, and collect the related statistics/model info for imputation.
- Parameters
- dataDataFrame
Input data with missing values.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- categorical_variablestr, optional
Names of columns with INTEGER data type that should actually be treated as categorical.
By default, columns of INTEGER and DOUBLE type are all treated numerical, while columns of VARCHAR or NVARCHAR type are treated as categorical.
- strategy_by_colListOfTuples, optional
Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.
Each tuple in the list should contain at least two elements, such that:
the 1st element is the name of a column;
the 2nd element is the imputation strategy of that column, valid strategies include: 'non', 'delete', 'most_frequent', 'categorical_const', 'mean', 'median', 'numerical_const', 'als'.
If the imputation strategy is 'categorical_const' or 'numerical_const', then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column.
- An example for illustration:
[('V1', 'categorical_const', '0'),
('V5','median')]
- Returns
- DataFrame
Imputed result using specified strategy, with the same data structure, i.e. column names and data types same as
data
.
- transform(data, key=None, thread_ratio=None)
The function imputes missing values a DataFrame using statistic/model info collected from another DataFrame.
- Parameters
- dataDataFrame
Input DataFrame.
- keystr, optional
Name of ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- thread_ratiofloat, optional
Controls the proportion of available threads to use.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use up to that percentage of available threads.
Values outside this range tell HANA PAL to heuristically determine the number of threads to use.
Defaults to 0.0.
- Returns
- DataFrame
Imputation result, structured same as
data
.Statistics for the imputation result, structured as:
STAT_NAME: type NVACHAR(256), statistics name.
STAT_VALUE: type NVACHAR(5000), statistics value.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.