hanaml.Imputer.Rd
hanaml.Imputer is a R wrapper for SAP HANA PAL Missing Value Handling. Missing value imputation for DataFrame.
hanaml.Imputer(
data = NULL,
key = NULL,
strategy = NULL,
strategy.by.col = NULL,
als.factors = NULL,
als.lambda = NULL,
als.maxit = NULL,
als.randomstate = NULL,
als.exit.threshold = NULL,
als.exit.interval = NULL,
als.linsolver = NULL,
als.cg.maxit = NULL,
als.centering = NULL,
als.scaling = NULL,
categorical.variable = NULL,
thread.ratio = NULL
)
DataFrame
DataFrame containting the data.
character
Name of the ID column.
character, optional
"non"
: Does nothing. Leave all columns untouched.
"most_frequent.mean"
: For numerical columns, filling all missing values by the mean; for
categorical columns, fills all missing values with the most frequent value.
"most_frequent.median"
: For numerical columns, fills all missing values by the median;
for categorical columns, fills all missing values with the most frequent value.
"most_frequent.zero"
: For numerical columns, fills all missing values with zeros;
for categorical columns, fills all missing values with the most frequent value.
"most_frequent.als"
: For numerical columns, fills each missing value by the value imputed by a
matrix completion model trained using alternating least squares method;
for categorical columns, fills all missing values with the most frequent value.
"delete"
: Deletes all rows with missing values.
The entire row in table will be deleted.
Defaults to "most_frequent.mean".
list, optional
Specifies the imputation strategy for a set of columns, which
overrides the overall strategy for data imputation.
Elements of this list must be named. The names must be column names,
while each value should either be the imputation strategy applied to that column,
or the replacement for all missing values within that column.
Valid column imputation strategies are listed as follows:
"mean", "median", "als", "non", "delete", "most_frequent".
The first five strategies are applicable to numerical columns, while the final three
strategies are applicable to categorical columns.
An illustrative example:
strategy.by.col = list(V1 = 0, V5 = "median")
,
which mean for column V1, all missing values shall be replaced by constant 0;
while for column V5, all missing values shall be by replaced by the median of all
available values in that column.
No default value.
integer, optional
Length of factor vectors in the ALS model.
It should be less than the number of numerical columns,
so that the imputation results would be meaningful.
Defaults to 3.
double, optional
L2 regularization applied to the factors in the ALS model.
Should be non-negative.
Defaults to 0.01.
integer, optional
Maximum number of iterations for solving the ALS model.
Defaults to 20.
integer, optional
0
: Uses the current time as the seed
Others
: Uses the specified value as the seed.
Defaults to 0.
integer, optional
Specify a value for stopping the training of ALS nmodel.
If the improvement of the cost function of the ALS model
is less than this value between consecutive checks, then
the training process will exit.
0 means there is no checking of the objective value when
running the algorithms, and it stops till the maximum number of
iterations has been reached.
Defaults to 0.
integer, optional
Specify the number of iterations between consecutive checking of
cost functions for the ALS model, so that one can see if the
pre-specified exit_threshold is reached.
Defaults to 5.
list("cholesky", "cg"), optional
Linear system solver for the ALS model, could be "cholesky" or "cg".
"cholesky" is usually much faster.
"cg" is recommended when als.factors
is large.
Defaults to "cholesky".
integer, optional
Specifies the maximum number of iterations for cg algorithm.
Invoked only when the 'cg' is the chosen linear system solver for ALS.
Defaults to 3.
logical, optional
Whether to center the data by column before training the ALS model.
Defaults to TRUE.
logical, optional
Whether to scale the data by column before training the ALS model.
Defaults to TRUE.
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
An "Imputer" object with the following attributes:
result : DataFrame
The same column structure (number of columns, column names, and column
types) with the table with which the model is trained.
model : DataFrame
statistics/model content.
The parameters having pre-fix "als" are invoked only when als' is the overall imputation strategy. Those parameters are for setting up the alternating-least-square(ALS) model for data imputation.
Input DataFrame data:
> data$Collect()
V0 V1 V2 V3 V4 V5
1 10 0 D NA 1.4 23.6
2 20 1 A 0.4 1.3 21.8
3 50 1 C NA 1.6 21.9
4 30 NA B 0.8 1.7 22.6
5 10 0 A 0.2 NA NA
6 10 0 <NA> 0.5 1.8 19.7
7 NA 0 C 0.5 NA 17.8
8 10 1 A 0.6 1.6 24.9
9 20 NA D 0.9 1.7 22.2
10 30 1 D 0.4 1.3 NA
11 50 0 <NA> 0.3 1.2 16.4
12 NA 1 B 0.7 1.2 19.3
13 30 1 A 0.2 1.1 21.7
14 30 0 D NA NA NA
15 NA 1 C 0.5 1.8 18.6
16 20 0 A 0.6 1.4 17.9
Model training and an "Imputer" object called ip is returned:
> ip <- hanaml.Imputer(data = data,
strategy = "most_frequent.mean",
categorical.variable = "V1",
strategy.by.col = c(V1 = 0))
Output:
> ip$result$Collect()
V0 V1 V2 V3 V4 V5
1 10 0 D 0.5076923076923077 1.4 23.6
2 20 1 A 0.4 1.3 21.8
3 50 1 C 0.5076923076923077 1.6 21.9
4 30 0 B 0.8 1.7 22.6
5 10 0 A 0.2 1.4692307692307693 20.646153846153844
6 10 0 A 0.5 1.8 19.7
7 24 0 C 0.5 1.4692307692307693 17.8
8 10 1 A 0.6 1.6 24.9
9 20 0 D 0.9 1.7 22.2
10 30 1 D 0.4 1.3 20.646153846153844
11 50 0 A 0.3 1.2 16.4
12 24 1 B 0.7 1.2 19.3
13 30 1 A 0.2 1.1 21.7
14 30 0 D 0.5076923076923077 1.4692307692307693 20.646153846153844
15 24 1 C 0.5 1.8 18.6
16 20 0 A 0.6 1.4 17.9