Time-series Missing Value Handling

This is an R wrapper for SAP PAL procedure PAL_IMPUTE_TIME_SERIES.

hanaml.ImputeTS(
  data = NULL,
  key = NULL,
  categorical.variable = NULL,
  imputation.type = NULL,
  base.algorithm = NULL,
  col.imputation.type = NULL,
  alpha = NULL,
  extrapolation = NULL,
  smooth.width = NULL,
  auxiliary.normalitytest = NULL,
  thread.ratio = NULL
)

Arguments

data

DataFrame
Specifies the input time-series data for missing value handling.

key

str
Specifies the column name in data that represents the order of time-series.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

imputation.type

str, optional
Specifies the overall imputation type(i.e. strategy) for all columns in data (exclusive of the key column).

"non" Does nothing. Leave all columns untouched.
"most_frequent.allzero": For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by zero.
"most_frequent.mean": For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing its mean.
"most_frequent.median": For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by median.
"most_frequent.sma": For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via simple moving average method.
"most_frequent.lma": For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via linear moving average method.
"most_fequent.ema": For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by exponential moving average method.
"most_frequent.linterp": For any categorical column, fill all missing values by the value that linear interpolation.
"most_frequent.sinterp": For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via spline interpolation.
"most_frequent.seadec": For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via seasonal decompose.
"most_frequent.locf": For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via last observation carried forward.
"most_frequent.nocb": For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via next observation carried back.

The preface "most_frequent" can be omitted for simplicity. For example, "most_frequent.linterp" can be simply replaced by "linterp" when inputting the imputation type.
Defaults to "most_fequent.mean".

col.imputation.type

list, optional
Specifies the column-wise imputation type that overwrites the overall imputation type.
Should be a named list such that the name each element corresponds to a column name in data, while the element value corresponds to a valid column imputation type.
Valid column imputation types include:

"allzero" : Fill all missing values by zero.
"mean" : Fill all missing values by the mean of the column.
"median" : Fill all missing values by the median of the column.
"sma" : Fill all missing values via simple moving average method.
"lma" : Fill all missing values via linear moving average method.
"ema" : Fill all missing values via exponential moving average method.
"linterp" : Fill all missing values via linear interpolation.
"sinterp" : Fill all missing values via spline interpolation.
"locf" : Fill all missing values via last observation carried forward.
"nocb" : Fill all missing values via next observation carried backward.

Among the above listed imputation types, "non" applies to both numerical and categorical columns, most_frequent' applies to categorical columns only, while the rest apply to numerical columns only. If the input goes beyond the above list of options, it will be treated as a constant value for the universal replacement of all missing values in that column.

alpha

numeric, optional
Specifies the criterion for the autocorrelation coefficient.
Valid values ranging from 0 to 1.
A larger value indicates stricter requirement for seasonality.
Defaults to 0.2

extrapolation

logical, optional
Specifies whether or not to extrapolate the endpoints of the time-series data.
Defaults to FALSE.

smooth.width

integer, optional
Specifies the width of the moving average applied to non-seasonal data, where 0 indicates linear fitting to extract trends.
Effective only to data columns that are to be imputed via seasonal decompose.

auxiliary.normalitytest

logical, optional
Specifies whether or not to use normality test to identify model types or not.
Defaults to FALSE.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

base.algorithms

str, optional
Specifies the base imputation algorithm for seasonal decompose.
Applicable only to numerical data columns that are to be imputed by seasonal decompose.
Valid options include:

"allzero" : Fill all missing values by zero.
"mean" : Fill all missing values by the mean of the column.
"median" : Fill all missing values by the median of the column.
"sma" : Fill all missing values via simple moving average method.
"lma" : Fill all missing values via linear moving average method.
"ema" : Fill all missing values via exponential moving average method.
"linterp" : Fill all missing values via linear interpolation.
"sinterp" : Fill all missing values via spline interpolation.
"locf" : Fill all missing values via last observation carried forward.
"nocb" : Fill all missing values via next observation carried backward.

Value

An "ImputeTS" object with the following attributes:

result : DataFrame
The same column structure (number of columns, column names, and column types) with the table with which the model is trained.
model : DataFrame
statistics/model content.

Examples

Input time-series data for imputation:


> data$Collect()
   ID    V     X
1   0  0.1     A
2   1  0.3     A
3   2   NA     A
4   3  0.7  <NA>
5   4  0.9     B
6   5  1.1     B

Setting up a proper imputation strategy to fill in all missing values:


> imp <- hanaml.ImputeTS(data, key = 'ID', imputation_type='most_frequent.linterp')
> imp$result$Collect()
   ID    V     X
1   0  0.1     A
2   1  0.3     A
3   2  0.5     A
4   3  0.7     A
5   4  0.9     B
6   5  1.1     B

Arguments

Value

Examples

See also