ImputeTS
- class hana_ml.algorithms.pal.preprocessing.ImputeTS(imputation_type=None, base_algorithm=None, alpha=None, extrapolation=None, smooth_width=None, auxiliary_normalitytest=None, thread_ratio=None)
Imputation of multi-dimensional time-series data. This is the Python wrapper for PAL procedure PAL_IMPUTE_TIME_SERIES.
- Parameters:
- imputation_typestr, optional
Specifies the overall imputation type for all columns of the time-series data. Valid options include:
'non' : Does nothing. Leave all columns untouched.
'most_frequent-allzero' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by zero.
'most_frequent-mean' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing its mean.
'most_frequent-median' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by median.
'most_frequent-sma' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via simple moving average method.
'most_frequent-lma' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via linear moving average method.
'most_frequent-ema' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values by exponential moving average method.
'most_frequent-linterp' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via linear interpolation.
'most_frequent-sinterp' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via spline interpolation.
'most_frequent-seadec' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via seasonal decompose.
'most_frequent-locf' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via last observation carried forward.
'most_frequent-nocb' : For any categorical column, fill all missing values by the value that appears most often in that column; while for any numerical column, fill all missing values via *next observation carried back.
The preface 'most_frequent' can be omitted for simplicity.
Defaults to 'most_frequent-mean'.
- base_algorithmstr, optional
Specifies the base imputation algorithm for seasonal decompose. Applicable only to numerical data columns that are to be imputed by seasonal decompose. Valid options include:
'allzero' : Fill all missing values by zero.
'mean' : Fill all missing values by the mean of the column.
'median' : Fill all missing values by the median of the column.
'sma' : Fill all missing values via simple moving average method.
'lma' : Fill all missing values via linear moving average method.
'ema' : Fill all missing values via exponential moving average method.
'linterp' : Fill all missing values via linear interpolation.
'sinterp' : Fill all missing values via spline interpolation.
'locf' : Fill all missing values via last observation carried forward.
'nocb' : Fill all missing values via next observation carried backward.
Defaults to 'mean'.
- alphafloat, optional
Specifies the criterion for the autocorrelation coefficient. The value range is (0, 1). A larger value indicates stricter requirement for seasonality.
Defaults to 0.2.
- extrapolationbool, optional
Specifies whether or not to extrapolate the endpoints of the time-series data.
Defaults to False.
- smooth_widthint, optional
Specifies the width of the moving average applied to non-seasonal data, where 0 indicates linear fitting to extract trends.
Effective only to data columns that are to be imputed via seasonal decompose.
- auxiliary_normalitytestbool, optional
Specifies whether to use normality test to identify model types or not.
Defaults False.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.
- Attributes:
- model_DataFrame
A column-wise time-series imputation model stored in statistics format, i.e. with stat names and stat values.
- result_DataFrame
The imputation result, structured the same as the data used for obtaining the time-series imputation model, with all missing valued filled.
Methods
fit(data[, key, features, ...])Fit function for Time-series Imputation.
fit_transform(data[, key, features, ...])Impute the input data and returned the imputation result.
transform(data[, key, features, ...])Impute TS data using model info.
Examples
>>> imp = ImputeTS(imputation_type='most_frequent-linterp') >>> res = imp.fit_transform(data=df, key='ID') >>> res.collect()
- fit(data, key=None, features=None, categorical_variable=None, col_imputation_type=None)
Fit function for Time-series Imputation.
- Parameters:
- dataDataFrame
DataFrame containing the time-series data for imputation.
- keystr, optional
Specifies the name of the time-stamp column of
datathat represents data ordering.Data type of the column could be INTEGER, DATE or SECONDDATE.
Mandatory if
datais not indexed by a single column.Defaults to index column of
dataif not provided.- featuresstr or a list of str, optional
Specifies the names of the columns in
datathat are to be imputed.Defaults to all non-key columns of
data.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- col_imputation_typeListOfTuples or dict, optional
Specifies the column-wise imputation type that overwrites the generic imputation type in class initialization.
Should be list of tuples, where each tuple contains 2 elements:
1st element : the column name
2nd element : the imputation type or value. Imputation type could be one of the following:
'non' : Does nothing.
'most_frequent' : Fill all missing values the most frequently observed value.
'allzero' : Fill all missing values by zero.
'mean' : Fill all missing values by the mean of the column.
'median' : Fill all missing values by the median of the column.
'sma' : Fill all missing values via simple moving average method.
'lma' : Fill all missing values via linear moving average method.
'ema' : Fill all missing values via exponential moving average method.
'linterp' : Fill all missing values via linear interpolation.
'sinterp' : Fill all missing values via spline interpolation.
'locf' : Fill all missing values via last observation carried forward.
'nocb' : Fill all missing values via next observation carried backward.
Among the above options, 'non' applies to both numerical and categorical columns, 'most_frequent' applies to categorical columns only, while the rest apply to numerical columns only. If the input goes beyond the above list of options, it will be treated as a constant value for the universal replacement of all missing values in the column.
- Returns:
- A fitted object of class "ImputeTS".
- fit_transform(data, key=None, features=None, categorical_variable=None, col_imputation_type=None)
Impute the input data and returned the imputation result.
- Parameters:
- dataDataFrame
DataFrame containing the time-series data for imputation.
- keystr, optional
Specifies the name of the time-stamp column of
datathat represents data ordering.Data type of the column could be INTEGER, DATE or SECONDDATE.
Mandatory if
datais not indexed by a single column.Defaults to index column of
dataif not provided.- featuresstr or a list of str, optional
Specifies the names of the columns in
datathat are to be imputed.Defaults to all non-key columns of
data.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- col_imputation_typeListOfTuples, optional
Specifies the column-wise imputation type that overwrites the generic imputation type in class initialization.
Should be list of tuples, where each tuple contains 2 elements:
1st element : the column name
2nd element : the imputation type or value. Imputation type could be one of the following:
'non' : Does nothing.
'most_frequent' : Fill all missing values the most frequently observed value.
'allzero' : Fill all missing values by zero.
'mean' : Fill all missing values by the mean of the column.
'median' : Fill all missing values by the median of the column.
'sma' : Fill all missing values via simple moving average method.
'lma' : Fill all missing values via linear moving average method.
'ema' : Fill all missing values via exponential moving average method.
'linterp' : Fill all missing values via linear interpolation.
'sinterp' : Fill all missing values via spline interpolation.
'locf' : Fill all missing values via last observation carried forward.
'nocb' : Fill all missing values via next observation carried backward.
Among the above options, 'non' applies to both numerical and categorical columns, 'most_frequent' applies to categorical columns only, while the rest apply to numerical columns only. If the input goes beyond the above list of options, it will be treated as a constant value for the univeral replacement of all missing values in the column.
- Returns:
- DataFrame
The imputed result of
data.
- transform(data, key=None, features=None, thread_ratio=None, model=None)
Impute TS data using model info.
- Parameters:
- dataDataFrame
DataFrame containing the time-series data for imputation by model.
- keystr, optional
Specifies the name of the time-stamp column of
datathat represents data ordering.Data type of the column could be INTEGER, DATE or SECONDDATE.
Mandatory if
datais not indexed by a single column.Defaults to index column of
dataif not provided.- featuresstr or a list of str, optional
Specifies the names of the columns in
datathat are to be imputed by model.Defaults to all non-key columns of
data.- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.
- modelDataFrame, optional
Specifies the model for time-series imputation.
Defaults to self.`model_`.
- Returns:
- DataFrame
The imputed result of
databy model.- DataFrame
Statistics, storing the imputation types of all selected feature columns in
data.
Inherited Methods from PALBase
Besides those methods mentioned above, the ImputeTS class also inherits methods from PALBase class, please refer to PAL Base for more details.