R: Imputer

hanaml.Imputer {hana.ml.r}

R Documentation

Imputer

Description

Missing value imputation for DataFrame.

Usage

hanaml.Imputer(conn.context, data = NULL, key = NULL, strategy = NULL,
               strategy.by.col = NULL, als.factors = NULL,
               als.lambda = NULL, als.maxit = NULL,
               als.randomstate = NULL, als.exit.threshold = NULL,
               als.exit.interval = NULL, als.linsolver = NULL,
               als.cg.maxit = NULL,
               als.centering = NULL, als.scaling = NULL,
               categorical.variable = NULL,
               thread.ratio = NULL)

Arguments

`conn.context`	`ConnectionContext` Database connection object.
`data`	`DataFrame` Dataset used for training.
`key`	`character, optional` Name of the ID column.
`strategy`	`character, optional` The overall imputation strategy. Choices are mostly for numerical columns. For categorical columns, if mssing values are not left outouched or deleted, then they will be replaced by the most frequent values of their columns by default. `"non"`: Does nothing. Leave all columns untouched. `"mean"`: For numerical columns, filling all missing values by the mean; for categorical columns, fills all missing values with the most frequent value. `"median"`: For numerical columns, fills all missing values by the median; for categorical columns, fills all missing values with the most frequent value. `"zero"`: For numerical columns, fills all missing values with zeros; for categorical columns, fills all missing values with the most frequent value. `"als"`: For numerical columns, fills each missing value by the value imputed by a matrix completion model trained using alternating least squares method; for categorical columns, fills all missing values with the most frequent value. `"delete"`: Deletes all rows with missing values. The entire row in table will be deleted. Defaults to 'mean'.
`strategy.by.col`	`list, optional` Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation. Elements of this list must be named. The names must be column names, while each value should either be the imputation strategy applied to that column, or the replacement for all missing values within that column. Valid column imputation strategies are listed as follows: "mean", "median", "als", "non", "delete", "most_frequent". The first five strategies are applicable to numerical columns, while the final three strategies are applicable to categorical columns. An illustrative example: stragegy.by.col = list(V1 = 0, V5 = "median"), which mean for column V1, all missing values shall be replaced by constant 0; while for column V5, all missing values shall be by replaced by the median of all available values in that column.
`als.factors`	`integer, optional` Length of factor vectors in the ALS model. It should be less than the number of numerical columns, so that the imputation results would be meaningful. Defaults to 3.
`als.lambda`	`integer, optional` L2 regularization applied to the factors in the ALS model. Should be non-negative. Defaults to 0.01.
`als.maxit`	`integer, optional` Maximum number of iterations for solving the ALS model. Defaults to 20.
`als.randomstate`	`integer, optional` Specifies the seed of the random number generator used in the training of ALS model: `0`: Uses the current time as the seed `Others`: Uses the specified value as the seed. Defaults to 0.
`als.exit.threshold`	`integer, optional` Specify a value for stopping the training of ALS nmodel. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit. 0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached. Defaults to 0.
`als.exit.interval`	`integer, optional` Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified exit_threshold is reached. Defaults to 5.
`als.linsolver`	`character, optional` Linear system solver for the ALS model, could be "cholsky" or "cg". "cholsky" is usually much faster. "cg" is recommended when als_factors is large. Defaults to 'cholsky'.
`als.cg.maxit`	`int, optional` Specifies the maximum number of iterations for cg algorithm. Invoked only when the 'cg' is the chosen linear system solver for ALS. Defaults to 3.
`als.centering`	`logical, optional` Whether to center the data by column before training the ALS model. Defaults to TRUE.
`als.scaling`	`logical, optional` Wheter to scale the data by column before training the ALS model. Defaults to TRUE.
`categorical.variable`	`character or list of characters, optional` Names of columns with INTEGER data type that should actually be treated as categorical. By default, columns of INTEGER and DOUBLE type are all treated numerical, while columns of VARCHAR or NVARCHAR type are treated as categorical.
`thread.ratio`	`integer, optional` Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.0.

Format

R6Class object.

Value

An "Imputer" object with the following attributes:

result : DataFrame
The same column structure (number of columns, column names, and column types) with the table with which the model is trained.
model : DataFrame
statistics/model content.

Note

The parameters having pre-fix 'als' are invoked only when als' is the overall imputation strategy. Those parameters are for setting up the alternating-least-square(ALS) mdoel for data imputation.

Examples

## Not run: 
 Input DataFrame data for training:
 > data$Collect()
   V0     V1     V2    V3     V4     V5
1  10     0      D     NA    1.4   23.6
2  20     1      A    0.4    1.3   21.8
3  50     1      C     NULL  1.6   21.9
4  30    NULL    B    0.8    1.7   22.6
5  10     0      A    0.2    NULL  NULL
6  10     0   <NULL>  0.5    1.8   19.7
7  NULL   0      C    0.5    NULL  17.8
8  10     1      A    0.6    1.6   24.9
9  20   NULL     D    0.9    1.7   22.2
10 30     1      D    0.4    1.3   NULL
11 50     0   <NULL>  0.3    1.2   16.4
12 NULL   1       B   0.7    1.2   19.3
13 30     1       A   0.2    1.1   21.7
14 30     0       D   NULL   NULL  NULL
15 NULL   1       C   0.5    1.8   18.6
16 20     0       A   0.6    1.4   17.9

 Model training and a "imputer" object is returned:
 >  imputer <- hanaml.Imputer(conn, data, strategy = "mean",
                              categorical.variable = "V1",
                              strategy.by.col = c(V1 = 0))
Expected output:
> imputer$result$Collect()
    V0  V1 V2     V3               V4                 V5
1   10  0  D  0.5076923076923077  1.4                23.6
2   20  1  A  0.4                 1.3                21.8
3   50  1  C  0.5076923076923077  1.6                21.9
4   30  0  B  0.8                 1.7                22.6
5   10  0  A  0.2                 1.4692307692307693 20.646153846153844
6   10  0  A  0.5                 1.8                19.7
7   24  0  C  0.5                 1.4692307692307693 17.8
8   10  1  A  0.6                 1.6                24.9
9   20  0  D  0.9                 1.7                22.2
10  30  1  D  0.4                 1.3                20.646153846153844
11  50  0  A  0.3                 1.2                16.4
12  24  1  B  0.7                 1.2                19.3
13  30  1  A  0.2                 1.1                21.7
14  30  0  D  0.5076923076923077  1.4692307692307693 20.646153846153844
15  24  1  C  0.5                 1.8                18.6
16  20  0  A  0.6                 1.4                17.9

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]