hanaml.Imputer {hana.ml.r}R Documentation

Imputer

Description

Missing value imputation for DataFrame.

Usage

hanaml.Imputer(conn.context, data = NULL, key = NULL, strategy = NULL,
               strategy.by.col = NULL, als.factors = NULL,
               als.lambda = NULL, als.maxit = NULL,
               als.randomstate = NULL, als.exit.threshold = NULL,
               als.exit.interval = NULL, als.linsolver = NULL,
               als.cg.maxit = NULL,
               als.centering = NULL, als.scaling = NULL,
               categorical.variable = NULL,
               thread.ratio = NULL)

Arguments

conn.context

ConnectionContext
Database connection object.

data

DataFrame
Dataset used for training.

key

character, optional
Name of the ID column.

strategy

character, optional

The overall imputation strategy. Choices are mostly for numerical columns. For categorical columns, if mssing values are not left outouched or deleted, then they will be replaced by the most frequent values of their columns by default.

  • "non": Does nothing. Leave all columns untouched.

  • "mean": For numerical columns, filling all missing values by the mean; for categorical columns, fills all missing values with the most frequent value.

  • "median": For numerical columns, fills all missing values by the median; for categorical columns, fills all missing values with the most frequent value.

  • "zero": For numerical columns, fills all missing values with zeros; for categorical columns, fills all missing values with the most frequent value.

  • "als": For numerical columns, fills each missing value by the value imputed by a matrix completion model trained using alternating least squares method; for categorical columns, fills all missing values with the most frequent value.

  • "delete": Deletes all rows with missing values. The entire row in table will be deleted.

Defaults to 'mean'.

strategy.by.col

list, optional
Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.
Elements of this list must be named. The names must be column names, while each value should either be the imputation strategy applied to that column, or the replacement for all missing values within that column.
Valid column imputation strategies are listed as follows:
"mean", "median", "als", "non", "delete", "most_frequent".
The first five strategies are applicable to numerical columns, while the final three strategies are applicable to categorical columns.
An illustrative example:
stragegy.by.col = list(V1 = 0, V5 = "median"), which mean for column V1, all missing values shall be replaced by constant 0; while for column V5, all missing values shall be by replaced by the median of all available values in that column.

als.factors

integer, optional
Length of factor vectors in the ALS model. It should be less than the number of numerical columns, so that the imputation results would be meaningful.

Defaults to 3.

als.lambda

integer, optional
L2 regularization applied to the factors in the ALS model.
Should be non-negative.

Defaults to 0.01.

als.maxit

integer, optional
Maximum number of iterations for solving the ALS model.
Defaults to 20.

als.randomstate

integer, optional

Specifies the seed of the random number generator used in the training of ALS model:

  • 0: Uses the current time as the seed

  • Others: Uses the specified value as the seed.

Defaults to 0.

als.exit.threshold

integer, optional
Specify a value for stopping the training of ALS nmodel. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.
0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.
Defaults to 0.

als.exit.interval

integer, optional
Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified exit_threshold is reached.
Defaults to 5.

als.linsolver

character, optional
Linear system solver for the ALS model, could be "cholsky" or "cg".
"cholsky" is usually much faster. "cg" is recommended when als_factors is large.
Defaults to 'cholsky'.

als.cg.maxit

int, optional
Specifies the maximum number of iterations for cg algorithm. Invoked only when the 'cg' is the chosen linear system solver for ALS.
Defaults to 3.

als.centering

logical, optional
Whether to center the data by column before training the ALS model.
Defaults to TRUE.

als.scaling

logical, optional
Wheter to scale the data by column before training the ALS model.
Defaults to TRUE.

categorical.variable

character or list of characters, optional
Names of columns with INTEGER data type that should actually be treated as categorical.
By default, columns of INTEGER and DOUBLE type are all treated numerical, while columns of VARCHAR or NVARCHAR type are treated as categorical.

thread.ratio

integer, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.0.

Format

R6Class object.

Value

An "Imputer" object with the following attributes:

Note

The parameters having pre-fix 'als' are invoked only when als' is the overall imputation strategy. Those parameters are for setting up the alternating-least-square(ALS) mdoel for data imputation.

See Also

transform.Imputer

Examples

## Not run: 
 Input DataFrame data for training:
 > data$Collect()
   V0     V1     V2    V3     V4     V5
1  10     0      D     NA    1.4   23.6
2  20     1      A    0.4    1.3   21.8
3  50     1      C     NULL  1.6   21.9
4  30    NULL    B    0.8    1.7   22.6
5  10     0      A    0.2    NULL  NULL
6  10     0   <NULL>  0.5    1.8   19.7
7  NULL   0      C    0.5    NULL  17.8
8  10     1      A    0.6    1.6   24.9
9  20   NULL     D    0.9    1.7   22.2
10 30     1      D    0.4    1.3   NULL
11 50     0   <NULL>  0.3    1.2   16.4
12 NULL   1       B   0.7    1.2   19.3
13 30     1       A   0.2    1.1   21.7
14 30     0       D   NULL   NULL  NULL
15 NULL   1       C   0.5    1.8   18.6
16 20     0       A   0.6    1.4   17.9

 Model training and a "imputer" object is returned:
 >  imputer <- hanaml.Imputer(conn, data, strategy = "mean",
                              categorical.variable = "V1",
                              strategy.by.col = c(V1 = 0))
Expected output:
> imputer$result$Collect()
    V0  V1 V2     V3               V4                 V5
1   10  0  D  0.5076923076923077  1.4                23.6
2   20  1  A  0.4                 1.3                21.8
3   50  1  C  0.5076923076923077  1.6                21.9
4   30  0  B  0.8                 1.7                22.6
5   10  0  A  0.2                 1.4692307692307693 20.646153846153844
6   10  0  A  0.5                 1.8                19.7
7   24  0  C  0.5                 1.4692307692307693 17.8
8   10  1  A  0.6                 1.6                24.9
9   20  0  D  0.9                 1.7                22.2
10  30  1  D  0.4                 1.3                20.646153846153844
11  50  0  A  0.3                 1.2                16.4
12  24  1  B  0.7                 1.2                19.3
13  30  1  A  0.2                 1.1                21.7
14  30  0  D  0.5076923076923077  1.4692307692307693 20.646153846153844
15  24  1  C  0.5                 1.8                18.6
16  20  0  A  0.6                 1.4                17.9

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]