Imputer

hanaml.Imputer is a R wrapper for SAP HANA PAL Missing Value Handling. Missing value imputation for DataFrame.

hanaml.Imputer(
  data = NULL,
  key = NULL,
  strategy = NULL,
  strategy.by.col = NULL,
  als.factors = NULL,
  als.lambda = NULL,
  als.maxit = NULL,
  als.randomstate = NULL,
  als.exit.threshold = NULL,
  als.exit.interval = NULL,
  als.linsolver = NULL,
  als.cg.maxit = NULL,
  als.centering = NULL,
  als.scaling = NULL,
  categorical.variable = NULL,
  thread.ratio = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

strategy

character, optional

"non": Does nothing. Leave all columns untouched.
"most_frequent.mean": For numerical columns, filling all missing values by the mean; for categorical columns, fills all missing values with the most frequent value.
"most_frequent.median": For numerical columns, fills all missing values by the median; for categorical columns, fills all missing values with the most frequent value.
"most_frequent.zero": For numerical columns, fills all missing values with zeros; for categorical columns, fills all missing values with the most frequent value.
"most_frequent.als": For numerical columns, fills each missing value by the value imputed by a matrix completion model trained using alternating least squares method; for categorical columns, fills all missing values with the most frequent value.
"delete": Deletes all rows with missing values. The entire row in table will be deleted.

Defaults to "most_frequent.mean".

strategy.by.col

list, optional
Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.
Elements of this list must be named. The names must be column names, while each value should either be the imputation strategy applied to that column, or the replacement for all missing values within that column.
Valid column imputation strategies are listed as follows:
"mean", "median", "als", "non", "delete", "most_frequent".
The first five strategies are applicable to numerical columns, while the final three strategies are applicable to categorical columns.
An illustrative example:

strategy.by.col = list(V1 = 0, V5 = "median")

, which mean for column V1, all missing values shall be replaced by constant 0; while for column V5, all missing values shall be by replaced by the median of all available values in that column.
No default value.

als.factors

integer, optional
Length of factor vectors in the ALS model. It should be less than the number of numerical columns, so that the imputation results would be meaningful.
Defaults to 3.

als.lambda

double, optional
L2 regularization applied to the factors in the ALS model.
Should be non-negative.
Defaults to 0.01.

als.maxit

integer, optional
Maximum number of iterations for solving the ALS model.
Defaults to 20.

als.randomstate

integer, optional

0: Uses the current time as the seed
Others: Uses the specified value as the seed.

Defaults to 0.

als.exit.threshold

integer, optional
Specify a value for stopping the training of ALS nmodel. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.
0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.
Defaults to 0.

als.exit.interval

integer, optional
Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified exit_threshold is reached.
Defaults to 5.

als.linsolver

list("cholesky", "cg"), optional
Linear system solver for the ALS model, could be "cholesky" or "cg".
"cholesky" is usually much faster. "cg" is recommended when als.factors is large.
Defaults to "cholesky".

als.cg.maxit

integer, optional
Specifies the maximum number of iterations for cg algorithm. Invoked only when the 'cg' is the chosen linear system solver for ALS.
Defaults to 3.

als.centering

logical, optional
Whether to center the data by column before training the ALS model.
Defaults to TRUE.

als.scaling

logical, optional
Whether to scale the data by column before training the ALS model.
Defaults to TRUE.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

Value

An "Imputer" object with the following attributes:

result : DataFrame
The same column structure (number of columns, column names, and column types) with the table with which the model is trained.
model : DataFrame
statistics/model content.

Note

The parameters having pre-fix "als" are invoked only when als' is the overall imputation strategy. Those parameters are for setting up the alternating-least-square(ALS) model for data imputation.

Examples

Input DataFrame data:


 > data$Collect()
   V0    V1     V2     V3     V4     V5
1  10     0      D     NA    1.4   23.6
2  20     1      A    0.4    1.3   21.8
3  50     1      C     NA    1.6   21.9
4  30    NA      B    0.8    1.7   22.6
5  10     0      A    0.2     NA     NA
6  10     0   <NA>    0.5    1.8   19.7
7  NA     0      C    0.5     NA   17.8
8  10     1      A    0.6    1.6   24.9
9  20    NA      D    0.9    1.7   22.2
10 30     1      D    0.4    1.3     NA
11 50     0   <NA>    0.3    1.2   16.4
12 NA     1      B    0.7    1.2   19.3
13 30     1      A    0.2    1.1   21.7
14 30     0      D    NA     NA      NA
15 NA     1      C    0.5    1.8   18.6
16 20     0      A    0.6    1.4   17.9

Model training and an "Imputer" object called ip is returned:


 > ip <- hanaml.Imputer(data = data,
                        strategy = "most_frequent.mean",
                        categorical.variable = "V1",
                        strategy.by.col = c(V1 = 0))

Output:


> ip$result$Collect()
    V0  V1 V2                 V3                  V4                  V5
1   10  0  D  0.5076923076923077                 1.4                23.6
2   20  1  A                 0.4                 1.3                21.8
3   50  1  C  0.5076923076923077                 1.6                21.9
4   30  0  B                 0.8                 1.7                22.6
5   10  0  A                 0.2  1.4692307692307693  20.646153846153844
6   10  0  A                 0.5                 1.8                19.7
7   24  0  C                 0.5  1.4692307692307693                17.8
8   10  1  A                 0.6                 1.6                24.9
9   20  0  D                 0.9                 1.7                22.2
10  30  1  D                 0.4                 1.3  20.646153846153844
11  50  0  A                 0.3                 1.2                16.4
12  24  1  B                 0.7                 1.2                19.3
13  30  1  A                 0.2                 1.1                21.7
14  30  0  D  0.5076923076923077  1.4692307692307693  20.646153846153844
15  24  1  C                 0.5                 1.8                18.6
16  20  0  A                 0.6                 1.4                17.9

Arguments

Value

Note

Examples

See also