ImputeTS

hana_ml.algorithms.pal.preprocessing.ImputeTS(imputation_type=None, base_algorithm=None, alpha=None, extrapolation=None, smooth_width=None, auxiliary_normalitytest=None, thread_ratio=None)

Imputation of multi-dimensional time-series data. This is the Python wrapper for PAL procedure PAL_IMPUTE_TIME_SERIES.

Parameters
imputation_typestr, optional

Specifies the overall imputation type for all columns of the time-series data. Valid options include:

  • 'non' : Does nothing. Leave all columns untouched.

  • 'most_frequent-allzero' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by zero.

  • 'most_frequent-mean' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing its mean.

  • 'most_frequent-median' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by median.

  • 'most_frequent-sma' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via simple moving average method.

  • 'most_frequent-lma' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via linear moving average method.

  • 'most_frequent-ema' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by exponential moving average method.

  • 'most_frequent-linterp' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via linear interpolation.

  • 'most_frequent-sinterp' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via spline interpolation.

  • 'most_frequent-seadec' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via seasonal decompose.

  • 'most_frequent-locf' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via last observation carried forward.

  • 'most_frequent-nocb' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via *next observation carried back.

The preface 'most_frequent' can be omitted for simplicity.

Defaults to 'most_frequent-mean'.

base_algorithmstr, optional

Specifies the base imputation algorithm for seasonal decompose. Applicable only to numerical data columns that are to be imputed by seasonal decompose. Valid options include:

  • 'allzero' : Fill all missing values by zero.

  • 'mean' : Fill all missing values by the mean of the column.

  • 'median' : Fill all missing values by the median of the column.

  • 'sma' : Fill all missing values via simple moving average method.

  • 'lma' : Fill all missing values via linear moving average method.

  • 'ema' : Fill all missing values via exponential moving average method.

  • 'linterp' : Fill all missing values via linear interpolation.

  • 'sinterp' : Fill all missing values via spline interpolation.

  • 'locf' : Fill all missing values via last observation carried forward.

  • 'nocb' : Fill all missing values via next observation carried backward.

Defaults to 'mean'.

alphafloat, optional

Specifies the criterion for the autocorrelation coefficient. The value range is (0, 1). A larger value indicates stricter requirement for seasonality.

Defaults to 0.2.

extrapolationbool, optional

Specifies whether or not to extrapolate the endpoints of the time-series data.

Defaults to False.

smooth_widthint, optional

Specifies the width of the moving average applied to non-seasonal data, where 0 indicates linear fitting to extract trends.

Effective only to data columns that are to be imputed via seasonal decompose.

auxiliary_normalitytestbool, optional

Specifies whether to use normality test to identify model types or not.

Defaults False.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 1.

Attributes
model_DataFrame

A column-wise time-series imputation model stored in statistics format, i.e. with stat names and stat values.

result_DataFrame

The imputation result, structured the same as the data used for obtaining the time-series imputation model, with all missing valued filled.

Examples

>>> imp = ImputeTS(imputation_type='most_frequent-linterp')
>>> res = imp.fit_transform(data=df, key='ID')
>>> res.collect()