ImputeTS
- class hana_ml.algorithms.pal.preprocessing.ImputeTS(imputation_type=None, base_algorithm=None, alpha=None, extrapolation=None, smooth_width=None, auxiliary_normalitytest=None, thread_ratio=None)
Imputation of multi-dimensional time-series data. This is the Python wrapper for PAL procedure PAL_IMPUTE_TIME_SERIES.
- Parameters:
- imputation_typestr, optional
Specifies the overall imputation type for all columns of the time-series data. Valid options include:
'non' : Does nothing. Leave all columns untouched.
'most_frequent-allzero' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by zero.
'most_frequent-mean' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing its mean.
'most_frequent-median' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by median.
'most_frequent-sma' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via simple moving average method.
'most_frequent-lma' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via linear moving average method.
'most_frequent-ema' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by exponential moving average method.
'most_frequent-linterp' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via linear interpolation.
'most_frequent-sinterp' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via spline interpolation.
'most_frequent-seadec' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via seasonal decompose.
'most_frequent-locf' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via last observation carried forward.
'most_frequent-nocb' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via *next observation carried back.
The preface 'most_frequent' can be omitted for simplicity.
Defaults to 'most_frequent-mean'.
- base_algorithmstr, optional
Specifies the base imputation algorithm for seasonal decompose. Applicable only to numerical data columns that are to be imputed by seasonal decompose. Valid options include:
'allzero' : Fill all missing values by zero.
'mean' : Fill all missing values by the mean of the column.
'median' : Fill all missing values by the median of the column.
'sma' : Fill all missing values via simple moving average method.
'lma' : Fill all missing values via linear moving average method.
'ema' : Fill all missing values via exponential moving average method.
'linterp' : Fill all missing values via linear interpolation.
'sinterp' : Fill all missing values via spline interpolation.
'locf' : Fill all missing values via last observation carried forward.
'nocb' : Fill all missing values via next observation carried backward.
Defaults to 'mean'.
- alphafloat, optional
Specifies the criterion for the autocorrelation coefficient. The value range is (0, 1). A larger value indicates stricter requirement for seasonality.
Defaults to 0.2.
- extrapolationbool, optional
Specifies whether or not to extrapolate the endpoints of the time-series data.
Defaults to False.
- smooth_widthint, optional
Specifies the width of the moving average applied to non-seasonal data, where 0 indicates linear fitting to extract trends.
Effective only to data columns that are to be imputed via seasonal decompose.
- auxiliary_normalitytestbool, optional
Specifies whether to use normality test to identify model types or not.
Defaults False.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.
Examples
>>> imp = ImputeTS(imputation_type='most_frequent-linterp') >>> res = imp.fit_transform(data=df, key='ID') >>> res.collect()
- Attributes:
- model_DataFrame
A column-wise time-series imputation model stored in statistics format, i.e. with stat names and stat values.
- result_DataFrame
The imputation result, structured the same as the data used for obtaining the time-series imputation model, with all missing valued filled.
Methods
fit
(data[, key, features, ...])Fit function for Time-series Imputation.
fit_transform
(data[, key, features, ...])Impute the input data and returned the imputation result.
transform
(data[, key, features, ...])Impute TS data using model info.
- fit(data, key=None, features=None, categorical_variable=None, col_imputation_type=None)
Fit function for Time-series Imputation.
- Parameters:
- dataDataFrame
DataFrame containing the time-series data for imputation.
- keystr, optional
Specifies the name of the time-stamp column of
data
that represents data ordering.Data type of the column could be INTEGER, DATE or SECONDDATE.
Mandatory if
data
is not indexed by a single column.Defaults to index column of
data
if not provided.- featuresstr or a list of str, optional
Specifies the names of the columns in
data
that are to be imputed.Defaults to all non-key columns of
data
.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- col_imputation_typeListOfTuples or dict, optional
Specifies the column-wise imputation type that overwrites the generic imputation type in class initialization.
Should be list of tuples, where each tuple contains 2 elements:
1st element : the column name
2nd element : the imputation type or value. Imputation type could be one of the following:
'non' : Does nothing.
'most_frequent' : Fill all missing values the most frequently observed value.
'allzero' : Fill all missing values by zero.
'mean' : Fill all missing values by the mean of the column.
'median' : Fill all missing values by the median of the column.
'sma' : Fill all missing values via simple moving average method.
'lma' : Fill all missing values via linear moving average method.
'ema' : Fill all missing values via exponential moving average method.
'linterp' : Fill all missing values via linear interpolation.
'sinterp' : Fill all missing values via spline interpolation.
'locf' : Fill all missing values via last observation carried forward.
'nocb' : Fill all missing values via next observation carried backward.
Among the above options, 'non' applies to both numerical and categorical columns, 'most_frequent' applies to categorical columns only, while the rest apply to numerical columns only. If the input goes beyond the above list of options, it will be treated as a constant value for the universal replacement of all missing values in the column.
- Returns:
- A fitted object of class "ImputeTS".
- fit_transform(data, key=None, features=None, categorical_variable=None, col_imputation_type=None)
Impute the input data and returned the imputation result.
- Parameters:
- dataDataFrame
DataFrame containing the time-series data for imputation.
- keystr, optional
Specifies the name of the time-stamp column of
data
that represents data ordering.Data type of the column could be INTEGER, DATE or SECONDDATE.
Mandatory if
data
is not indexed by a single column.Defaults to index column of
data
if not provided.- featuresstr or a list of str, optional
Specifies the names of the columns in
data
that are to be imputed.Defaults to all non-key columns of
data
.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- col_imputation_typeListOfTuples, optional
Specifies the column-wise imputation type that overwrites the generic imputation type in class initialization.
Should be list of tuples, where each tuple contains 2 elements:
1st element : the column name
2nd element : the imputation type or value. Imputation type could be one of the following:
'non' : Does nothing.
'most_frequent' : Fill all missing values the most frequently observed value.
'allzero' : Fill all missing values by zero.
'mean' : Fill all missing values by the mean of the column.
'median' : Fill all missing values by the median of the column.
'sma' : Fill all missing values via simple moving average method.
'lma' : Fill all missing values via linear moving average method.
'ema' : Fill all missing values via exponential moving average method.
'linterp' : Fill all missing values via linear interpolation.
'sinterp' : Fill all missing values via spline interpolation.
'locf' : Fill all missing values via last observation carried forward.
'nocb' : Fill all missing values via next observation carried backward.
Among the above options, 'non' applies to both numerical and categorical columns, 'most_frequent' applies to categorical columns only, while the rest apply to numerical columns only. If the input goes beyond the above list of options, it will be treated as a constant value for the univeral replacement of all missing values in the column.
- Returns:
- DataFrame
The imputed result of
data
.
- transform(data, key=None, features=None, thread_ratio=None, model=None)
Impute TS data using model info.
- Parameters:
- dataDataFrame
DataFrame containing the time-series data for imputation by model.
- keystr, optional
Specifies the name of the time-stamp column of
data
that represents data ordering.Data type of the column could be INTEGER, DATE or SECONDDATE.
Mandatory if
data
is not indexed by a single column.Defaults to index column of
data
if not provided.- featuresstr or a list of str, optional
Specifies the names of the columns in
data
that are to be imputed by model.Defaults to all non-key columns of
data
.- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.
- modelDataFrame, optional
Specifies the model for time-series imputation.
Defaults to self.`model_`.
- Returns:
- DataFrame
The imputed result of
data
by model.- DataFrame
Statistics, storing the imputation types of all selected feature columns in
data
.
Inherited Methods from PALBase
Besides those methods mentioned above, the ImputeTS class also inherits methods from PALBase class, please refer to PAL Base for more details.