ImputeTS

class hana_ml.algorithms.pal.preprocessing.ImputeTS(imputation_type=None, base_algorithm=None, alpha=None, extrapolation=None, smooth_width=None, auxiliary_normalitytest=None, thread_ratio=None)

Imputation of multi-dimensional time-series data. This is the Python wrapper for PAL procedure PAL_IMPUTE_TIME_SERIES.

Parameters:

imputation_typestr, optional

Specifies the overall imputation type for all columns of the time-series data. Valid options include:

'non' : Does nothing. Leave all columns untouched.

'most_frequent-allzero' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by zero.

'most_frequent-mean' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing its mean.

'most_frequent-median' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by median.

'most_frequent-sma' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via simple moving average method.

'most_frequent-lma' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via linear moving average method.

'most_frequent-ema' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by exponential moving average method.

'most_frequent-linterp' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via linear interpolation.

'most_frequent-sinterp' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via spline interpolation.

'most_frequent-seadec' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via seasonal decompose.

'most_frequent-locf' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via last observation carried forward.

'most_frequent-nocb' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via *next observation carried back.

The preface 'most_frequent' can be omitted for simplicity.

Defaults to 'most_frequent-mean'.

base_algorithmstr, optional

Specifies the base imputation algorithm for seasonal decompose. Applicable only to numerical data columns that are to be imputed by seasonal decompose. Valid options include:

'allzero' : Fill all missing values by zero.

'mean' : Fill all missing values by the mean of the column.

'median' : Fill all missing values by the median of the column.

'sma' : Fill all missing values via simple moving average method.

'lma' : Fill all missing values via linear moving average method.

'ema' : Fill all missing values via exponential moving average method.

'linterp' : Fill all missing values via linear interpolation.

'sinterp' : Fill all missing values via spline interpolation.

'locf' : Fill all missing values via last observation carried forward.

'nocb' : Fill all missing values via next observation carried backward.

Defaults to 'mean'.

alphafloat, optional

Specifies the criterion for the autocorrelation coefficient. The value range is (0, 1). A larger value indicates stricter requirement for seasonality.

Defaults to 0.2.

extrapolationbool, optional

Specifies whether or not to extrapolate the endpoints of the time-series data.

Defaults to False.

smooth_widthint, optional

Specifies the width of the moving average applied to non-seasonal data, where 0 indicates linear fitting to extract trends.

Effective only to data columns that are to be imputed via seasonal decompose.

auxiliary_normalitytestbool, optional

Specifies whether to use normality test to identify model types or not.

Defaults False.

thread_ratiofloat, optional

Specifies the ratio of available threads to use for time-series imputation.

0: single thread

0~1: percentage

Others: heuristically determined

Defaults to 1.

Examples

Input time-series data for imputation:

>>> data.collect()
   ID    V     X
 0  0.1     A
 1  0.3     A
 2  NaN     A
 3  0.7  None
 4  0.9     B
 5  1.1     B

Setting up a proper imputation strategy to fill in all missing values:

>>> imp = ImputeTS(imputation_type='most_frequent-linterp')
>>> res = imp.fit_transform(data, key='ID')
>>> res.collect()
   ID    V     X
0   0  0.1     A
1   1  0.3     A
2   2  0.5     A
3   3  0.7     A
4   4  0.9     B
5   5  1.1     B

Attributes:

model_DataFrame: A column-wise time-series imputation model stored in statistics format, i.e. with stat names and stat values.
result_DataFrame: The imputation result, structured the same as the data used for obtaining the time-series imputation model, with all missing valued filled.

Methods

`fit`(data[, key, features, ...])	Fit function for Time-series Imputation.
`fit_transform`(data[, key, features, ...])	Impute the input data and returned the imputation result.
`transform`(data[, key, features, ...])	Impute TS data using model info.

fit(data, key=None, features=None, categorical_variable=None, col_imputation_type=None)

Fit function for Time-series Imputation.

Parameters:

dataDataFrame

DataFrame containing the time-series data for imputation.

keystr, optional

Specifies the name of the time-stamp column of data that represents data ordering.

Data type of the column could be INTEGER, DATE or SECONDDATE.

Mandatory if data is not indexed by a single column.

Defaults to index column of data if not provided.

featuresstr or ListOfStrings, optional

Specifies the names of the columns in data that are to be imputed.

Defaults to all non-key columns of data.

categorical_variablestr of ListOfStrings, optional

Specifies INTEGER column(s) that should be treated as categorical.

By default all INTEGER columns are treated as numerical.

col_imputation_typeListOfTuples or dict, optional

Specifies the column-wise imputation type that overwrites the generic imputation type in class initialization.

Should be list of tuples, where each tuple contains 2 elements:

1st element : the column name
2nd element : the imputation type or value. Imputation type could be one of the following:
- 'non' : Does nothing.
- 'most_frequent' : Fill all missing values the most frequently observed value.
- 'allzero' : Fill all missing values by zero.
- 'mean' : Fill all missing values by the mean of the column.
- 'median' : Fill all missing values by the median of the column.
- 'sma' : Fill all missing values via simple moving average method.
- 'lma' : Fill all missing values via linear moving average method.
- 'ema' : Fill all missing values via exponential moving average method.
- 'linterp' : Fill all missing values via linear interpolation.
- 'sinterp' : Fill all missing values via spline interpolation.
- 'locf' : Fill all missing values via last observation carried forward.
- 'nocb' : Fill all missing values via next observation carried backward.
Among the above options, 'non' applies to both numerical and categorical columns, 'most_frequent' applies to categorical columns only, while the rest apply to numerical columns only. If the input goes beyond the above list of options, it will be treated as a constant value for the universal replacement of all missing values in the column.

Returns:

A fitted object of class ImputeTS.

fit_transform(data, key=None, features=None, categorical_variable=None, col_imputation_type=None)

Impute the input data and returned the imputation result.

Parameters:

dataDataFrame

DataFrame containing the time-series data for imputation.

keystr, optional

Specifies the name of the time-stamp column of data that represents data ordering.

Data type of the column could be INTEGER, DATE or SECONDDATE.

Mandatory if data is not indexed by a single column.

Defaults to index column of data if not provided.

featuresstr or a list of str, optional

Specifies the names of the columns in data that are to be imputed.

Defaults to all non-key columns of data.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

By default all INTEGER columns are treated as numerical.

col_imputation_typeListOfTuples, optional

Specifies the column-wise imputation type that overwrites the generic imputation type in class initialization.

Should be list of tuples, where each tuple contains 2 elements:

1st element : the column name
2nd element : the imputation type or value. Imputation type could be one of the following:
- 'non' : Does nothing.
- 'most_frequent' : Fill all missing values the most frequently observed value.
- 'allzero' : Fill all missing values by zero.
- 'mean' : Fill all missing values by the mean of the column.
- 'median' : Fill all missing values by the median of the column.
- 'sma' : Fill all missing values via simple moving average method.
- 'lma' : Fill all missing values via linear moving average method.
- 'ema' : Fill all missing values via exponential moving average method.
- 'linterp' : Fill all missing values via linear interpolation.
- 'sinterp' : Fill all missing values via spline interpolation.
- 'locf' : Fill all missing values via last observation carried forward.
- 'nocb' : Fill all missing values via next observation carried backward.

Among the above options, 'non' applies to both numerical and categorical columns, 'most_frequent' applies to categorical columns only, while the rest apply to numerical columns only. If the input goes beyond the above list of options, it will be treated as a constant value for the univeral replacement of all missing values in the column.

Returns:

DataFrame: The imputed result of data.

transform(data, key=None, features=None, thread_ratio=None, model=None)

Impute TS data using model info.

Parameters:

dataDataFrame

DataFrame containing the time-series data for imputation by model.

keystr, optional

Specifies the name of the time-stamp column of data that represents data ordering.

Data type of the column could be INTEGER, DATE or SECONDDATE.

Mandatory if data is not indexed by a single column.

Defaults to index column of data if not provided.

featuresstr or a list of str, optional

Specifies the names of the columns in data that are to be imputed by model.

Defaults to all non-key columns of data.

thread_ratiofloat, optional

Specifies the ratio of available threads to use for time-series imputation by model.

0: single thread

0~1: percentage

Others: heuristically determined

Defaults to 1.

modelDataFrame, optional

Specifies the model for time-series imputation.

Defaults to self.`model_`.

Returns:

DataFrame: The imputed result of data by model.
DataFrame: Statistics, storing the imputation types of all selected feature columns in data.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the ImputeTS class also inherits methods from PALBase class, please refer to PAL Base for more details.