ImputeTS
- class hana_ml.algorithms.pal.preprocessing.ImputeTS(imputation_type=None, base_algorithm=None, alpha=None, extrapolation=None, smooth_width=None, auxiliary_normalitytest=None, thread_ratio=None)
Imputation of multi-dimensional time-series data. This is the Python wrapper for PAL procedure PAL_IMPUTE_TIME_SERIES.
- Parameters
- imputation_typestr, optional
Specifies the overall imputation type for all columns of the time-series data. Valid options include:
'non' : Does nothing. Leave all columns untouched.
'most_frequent-allzero' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by zero.
'most_frequent-mean' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing its mean.
'most_frequent-median' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by median.
'most_frequent-sma' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via simple moving average method.
'most_frequent-lma' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via linear moving average method.
'most_frequent-ema' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by exponential moving average method.
'most_frequent-linterp' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via linear interpolation.
'most_frequent-sinterp' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via spline interpolation.
'most_frequent-seadec' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via seasonal decompose.
'most_frequent-locf' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via last observation carried forward.
'most_frequent-nocb' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via *next observation carried back.
The preface 'most_frequent' can be omitted for simplicity.
Defaults to 'most_frequent-mean'.
- base_algorithmstr, optional
Specifies the base imputation algorithm for seasonal decompose. Applicable only to numerical data columns that are to be imputed by seasonal decompose. Valid options include:
'allzero' : Fill all missing values by zero.
'mean' : Fill all missing values by the mean of the column.
'median' : Fill all missing values by the median of the column.
'sma' : Fill all missing values via simple moving average method.
'lma' : Fill all missing values via linear moving average method.
'ema' : Fill all missing values via exponential moving average method.
'linterp' : Fill all missing values via linear interpolation.
'sinterp' : Fill all missing values via spline interpolation.
'locf' : Fill all missing values via last observation carried forward.
'nocb' : Fill all missing values via next observation carried backward.
Defaults to 'mean'.
- alphafloat, optional
Specifies the criterion for the autocorrelation coefficient. The value range is (0, 1). A larger value indicates stricter requirement for seasonality.
Defaults to 0.2.
- extrapolationbool, optional
Specifies whether or not to extrapolate the endpoints of the time-series data.
Defaults to False.
- smooth_widthint, optional
Specifies the width of the moving average applied to non-seasonal data, where 0 indicates linear fitting to extract trends.
Effective only to data columns that are to be imputed via seasonal decompose.
- auxiliary_normalitytestbool, optional
Specifies whether to use normality test to identify model types or not.
Defaults False.
- thread_ratiofloat, optional
Specifies the ratio of available threads to use for time-series imputation.
0: single thread
0~1: percentage
Others: heuristically determined
Defaults to 1.
Examples
Input time-series data for imputation:
>>> data.collect() ID V X 0 0 0.1 A 1 1 0.3 A 2 2 NaN A 3 3 0.7 None 4 4 0.9 B 5 5 1.1 B
Setting up a proper imputation strategy to fill in all missing values:
>>> imp = ImputeTS(imputation_type='most_frequent-linterp') >>> res = imp.fit_transform(data, key='ID') >>> res.collect() ID V X 0 0 0.1 A 1 1 0.3 A 2 2 0.5 A 3 3 0.7 A 4 4 0.9 B 5 5 1.1 B
- Attributes
- model_DataFrame
A column-wise time-series imputation model stored in statistics format, i.e. with stat names and stat values.
- result_DataFrame
The imputation result, structured the same as the data used for obtaining the time-series imputation model, with all missing valued filled.
Methods
fit
(data[, key, features, ...])Fit function for Time-series Imputation.
fit_transform
(data[, key, features, ...])Impute the input data and returned the imputation result.
transform
(data[, key, features, ...])Impute TS data using model info.
- fit(data, key=None, features=None, categorical_variable=None, col_imputation_type=None)
Fit function for Time-series Imputation.
- Parameters
- dataDataFrame
DataFrame containing the time-series data for imputation.
- keystr, optional
Specifies the name of the time-stamp column of
data
that represents data ordering.Data type of the column could be INTEGER, DATE or SECONDDATE.
Mandatory if
data
is not indexed by a single column.Defaults to index column of
data
if not provided.- featuresstr or ListOfStrings, optional
Specifies the names of the columns in
data
that are to be imputed.Defaults to all non-key columns of
data
.- categorical_variablestr of ListOfStrings, optional
Specifies INTEGER column(s) that should be treated as categorical.
By default all INTEGER columns are treated as numerical.
- col_imputation_typeListOfTuples or dict, optional
Specifies the column-wise imputation type that overwrites the generic imputation type in class initialization.
Should be list of tuples, where each tuple contains 2 elements:
1st element : the column name
2nd element : the imputation type or value. Imputation type could be one of the following:
'non' : Does nothing.
'most_frequent' : Fill all missing values the most frequently observed value.
'allzero' : Fill all missing values by zero.
'mean' : Fill all missing values by the mean of the column.
'median' : Fill all missing values by the median of the column.
'sma' : Fill all missing values via simple moving average method.
'lma' : Fill all missing values via linear moving average method.
'ema' : Fill all missing values via exponential moving average method.
'linterp' : Fill all missing values via linear interpolation.
'sinterp' : Fill all missing values via spline interpolation.
'locf' : Fill all missing values via last observation carried forward.
'nocb' : Fill all missing values via next observation carried backward.
Among the above options, 'non' applies to both numerical and categorical columns, 'most_frequent' applies to categorical columns only, while the rest apply to numerical columns only. If the input goes beyond the above list of options, it will be treated as a constant value for the universal replacement of all missing values in the column.
- Returns
- A fitted object of class ImputeTS.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- fit_transform(data, key=None, features=None, categorical_variable=None, col_imputation_type=None)
Impute the input data and returned the imputation result.
- Parameters
- dataDataFrame
DataFrame containing the time-series data for imputation.
- keystr, optional
Specifies the name of the time-stamp column of
data
that represents data ordering.Data type of the column could be INTEGER, DATE or SECONDDATE.
Mandatory if
data
is not indexed by a single column.Defaults to index column of
data
if not provided.- featuresstr or ListOfStrings, optional
Specifies the names of the columns in
data
that are to be imputed.Defaults to all non-key columns of
data
.- categorical_variablestr of ListOfStrings, optional
Specifies INTEGER column(s) that should be treated as categorical.
By default all INTEGER columns are treated as numerical.
- col_imputation_typeListOfTuples, optional
Specifies the column-wise imputation type that overwrites the generic imputation type in class initialization.
Should be list of tuples, where each tuple contains 2 elements:
1st element : the column name
2nd element : the imputation type or value. Imputation type could be one of the following:
'non' : Does nothing.
'most_frequent' : Fill all missing values the most frequently observed value.
'allzero' : Fill all missing values by zero.
'mean' : Fill all missing values by the mean of the column.
'median' : Fill all missing values by the median of the column.
'sma' : Fill all missing values via simple moving average method.
'lma' : Fill all missing values via linear moving average method.
'ema' : Fill all missing values via exponential moving average method.
'linterp' : Fill all missing values via linear interpolation.
'sinterp' : Fill all missing values via spline interpolation.
'locf' : Fill all missing values via last observation carried forward.
'nocb' : Fill all missing values via next observation carried backward.
Among the above options, 'non' applies to both numerical and categorical columns, 'most_frequent' applies to categorical columns only, while the rest apply to numerical columns only. If the input goes beyond the above list of options, it will be treated as a constant value for the univeral replacement of all missing values in the column.
- Returns
- DataFrame
The imputed result of
data
.
- transform(data, key=None, features=None, thread_ratio=None, model=None)
Impute TS data using model info.
- Parameters
- dataDataFrame
DataFrame containing the time-series data for imputation by model.
- keystr, optional
Specifies the name of the time-stamp column of
data
that represents data ordering.Data type of the column could be INTEGER, DATE or SECONDDATE.
Mandatory if
data
is not indexed by a single column.Defaults to index column of
data
if not provided.- featuresstr or ListOfStrings, optional
Specifies the names of the columns in
data
that are to be imputed by model.Defaults to all non-key columns of
data
.- thread_ratiofloat, optional
Specifies the ratio of available threads to use for time-series imputation by model.
0: single thread
0~1: percentage
Others: heuristically determined
Defaults to 1.
- modelDataFrame, optional
Specifies the model for time-series imputation.
Defaults to self.`model_`.
- Returns
- DataFrame
The imputed result of
data
by model.- DataFrame
Statistics, storing the imputation types of all selected feature columns in
data
.