MassiveAutomaticTimeSeries¶

class hana_ml.algorithms.pal.massive_auto_ml.MassiveAutomaticTimeSeries(scorings=None, generations=None, population_size=None, offspring_size=None, elite_number=None, min_layer=None, max_layer=None, mutation_rate=None, crossover_rate=None, random_seed=None, config_dict=None, fold_num=None, resampling_method=None, max_eval_time_mins=None, early_stop=None, percentage=None, gap_num=None, connections=None, alpha=None, delta=None, top_k_connections=None, top_k_pipelines=None, search_method=None, fine_tune_pipeline=None, fine_tune_resource=None, with_hyperband=None, reduction_rate=None, min_resource=None, max_resource=None, special_group_id='PAL_MASSIVE_PROCESSING_SPECIAL_GROUP_ID2', progress_indicator_id=None)¶

MassiveAutomaticTimeSeries offers an intelligent search among machine learning pipelines for time series tasks. Each machine learning pipeline contains several operators such as preprocessors, time series models, and transformers that follow the API of hana-ml algorithms.

For MassiveAutomaticTimeSeries parameter mappings between hana_ml and HANA PAL, please refer to the doc page: Parameter Mappings

Parameters

scoringsdict, optional

MassiveAutomaticTimeSeries supports multi-objective optimization with specified weights for each target. The goal is to maximize the target. Therefore, if you want to minimize a target, the weight for that target should be negative.

The available target options are as follows:

EVAR: Explained Variance. Higher values indicate better performance. It is recommended to assign a positive weight to this metric.
MAE: Mean Absolute Error. Lower values indicate better performance. It is recommended to assign a negative weight to this metric.
MAPE: Mean Absolute Percentage Error. Lower values indicate better performance. It is recommended to assign a negative weight to this metric.
MAX_ERROR: The maximum absolute difference between the observed value and the expected value. Lower values indicate better performance. It is recommended to assign a negative weight to this metric.
MSE: Mean Squared Error. Lower values indicate better performance. It is recommended to assign a negative weight to this metric.
R2: R-squared. Higher values indicate better performance. It is recommended to assign a positive weight to this metric.
RMSE: Root Mean Squared Error. Lower values indicate better performance. It is recommended to assign a negative weight to this metric.
WMAPE: Weighted Mean Absolute Percentage Error. Lower values indicate better performance. It is recommended to assign a negative weight to this metric.
LAYERS: The number of operators. Lower values indicate better performance. It is recommended to assign a negative weight to this metric.
SPEC: Stock keeping oriented Prediction Error Costs. Lower values indicate better performance. It is recommended to assign a negative weight to this metric.
TIME: Represents the computational time in seconds used. Lower values indicate better performance. It is recommended to assign a negative weight to this metric.

Defaults to {"MAE": -1.0, "EVAR": 1.0} (minimize MAE and maximize EVAR).

generationsint, optional

The number of iterations for pipeline optimization.

Defaults to 5.

population_sizeint, optional

When search_method is 'GA', population_size is the number of individuals in each generation in the genetic programming algorithm. Too few individuals can limit the possibilities of crossover and exploration of the search space, while too many can slow down the algorithm.
When search_method is 'random', population_size is the number of pipelines randomly generated and evaluated in random search.

Defaults to 20.

offspring_sizeint, optional

The number of offsprings to produce in each generation.

It controls the number of new individuals generated in each iteration by genetic operations from the population.

Defaults to the value of population_size.

elite_numberint, optional

The number of elites to output into the result table.

Defaults to 1/4 of population_size.

min_layerint, optional

The minimum number of operators in a pipeline.

Defaults to 1.

max_layerint, optional

The maximum number of operators in a pipeline.

Defaults to 5.

mutation_ratefloat, optional

The mutation rate for the genetic programming algorithm.

Represents the random search ability. A suitable value can prevent the GA from falling into a local optimum.

The sum of mutation_rate and crossover_rate cannot be greater than 1.0. When the sum is less than 1.0, the remaining probability will be used to regenerate.

Defaults to 0.9.

crossover_ratefloat, optional

The crossover rate for the genetic programming algorithm.

Represents the local search ability. A larger crossover rate will cause GA to converge to a local optimum faster.

The sum of mutation_rate and crossover_rate cannot be greater than 1.0. When the sum is less than 1.0, the remaining probability will be used to regenerate.

Defaults to 0.1.

random_seedint, optional

Specifies the seed for the random number generator. Uses system time if not provided.

No default value.

config_dictstr or dict, optional

The customized configuration for the search space.

{'light', 'default'}: use provided config_dict templates.
JSON format config_dict. It can be a JSON string or dict.

If it is None, the default config_dict will be used.

Defaults to None.

progress_indicator_idstr, optional

Set the ID used to output monitoring information of the optimization progress.

No default value.

fold_numint, optional

The number of folds in the cross-validation process.

Defaults to 5.

resampling_method{'rocv', 'block'}, optional

Specifies the resampling method for pipeline evaluation.

Defaults to 'rocv'.

max_eval_time_minsfloat, optional

Time limit to evaluate a single pipeline, in minutes.

Defaults to 0.0 (no time limit).

early_stopint, optional

Stop optimization when the best pipeline is not updated for the given consecutive generations. 0 means there is no early stop.

Defaults to 5.

percentagefloat, optional

Percentage between training data and test data. Only applicable when resampling_method is 'block'.

Defaults to 0.7.

gap_numint, optional

Number of samples to exclude from the end of each train set before the test set.

Defaults to 0.

connectionsstr or dict, optional

Specifies the connections in the Connection Constrained Optimization. The options are:

'default'
customized connections JSON string or dict.

Defaults to None. If connections is not provided, connection constrained optimization is not applied.

alphafloat, optional

Adjusts rejection probability in connection optimization.

Valid only when connections is set.

Defaults to 0.1.

deltafloat, optional

Controls the increase rate of connection weights.

Valid only when connections is set.

Defaults to 1.0.

top_k_connectionsint, optional

The number of top connections used to generate optimal connections.

Valid only when connections is set.

Defaults to 1/2 of (connection size in connections).

top_k_pipelinesint, optional

The number of pipelines used to update connections in each iteration.

Valid only when connections is set.

Defaults to 1/2 of offspring_size.

search_methodstr, optional

Optimization algorithm used in AutoML.

'GA': Genetic Algorithm
'random': Random Search

Defaults to 'GA'.

fine_tune_pipelinebool, optional

Specifies whether or not to fine-tune the pipelines generated by the genetic algorithm.

Valid only when search_method is 'GA'.

Defaults to False.

fine_tune_resourceint, optional

Specifies the resource limit to use for fine-tuning the pipelines generated by the genetic algorithm.

Valid only when fine_tune_pipeline is set to True.

Defaults to the value of population_size.

with_hyperbandbool, optional

Indicates whether to use Hyperband.

Only valid when search_method is "random".

Defaults to False.

reduction_ratefloat, optional

Specifies the reduction rate in the Hyperband method.

Only valid when with_hyperband is True.

Defaults to 1.

min_resourceint, optional

The minimum number of resources allocated in each iteration of Hyperband.

Only valid when with_hyperband is True.

Defaults to max(5, data.count()/10).

max_resourceint, optional

The maximum number of resources allocated in each iteration of Hyperband.

Only valid when with_hyperband is True.

Defaults to data.count().

Attributes

best_pipeline_: DataFrame

Best pipelines selected, structured as follows:

1st column: GROUP_ID, type INTEGER or NVARCHAR, pipeline GROUP IDs.
2nd column: ID, type INTEGER, pipeline IDs.
3rd column: PIPELINE, type NVARCHAR, pipeline contents.
4th column: SCORES, type NVARCHAR, scoring metrics for pipeline.

Available only when the pipeline parameter is not specified during the fitting process.

model_DataFrame or a list of DataFrames

If pipeline is not None, structured as follows:

1st column: GROUP_ID
2nd column: ROW_INDEX
3rd column: MODEL_CONTENT

If auto-ml is enabled, structured as follows:

1st DataFrame:
- 1st column: GROUP_ID
- 2nd column: ROW_INDEX
- 3rd column: MODEL_CONTENT
2nd DataFrame: best_pipeline_

info_DataFrame

Related info/statistics for MassiveAutomaticTimeSeries pipeline fitting, structured as follows:

1st column: GROUP_ID
2nd column: STAT_NAME
3rd column: STAT_VALUE

error_: DataFrame

Error information for the pipeline fitting process.

Methods

`cleanup_progress_log`(connection_context)	Clean up the progress log.
`delete_config_dict`([operator_name, ...])	Deletes the content of the config dict.
`disable_auto_sql_content`([disable])	Disable auto SQL content logging.
`disable_log_cleanup`([disable])	Disable the log clean up.
`disable_mlflow_autologging`()	Disable the mlflow autologging.
`disable_workload_class_check`()	Disable the workload class check.
`display_config_dict`([operator_name, category])	Displays the config dict.
`display_progress_table`(connection_context)	Return the progress table.
`enable_mlflow_autologging`([schema, meta, ...])	Enable the mlflow autologging.
`evaluate`(data[, pipeline, key, features, ...])	Evaluates a pipeline.
`fit`(data, group_key[, key, endog, exog, ...])	The fit function for MassiveAutomaticTimeSeries.
`get_best_pipeline`()	Return the best pipeline.
`get_best_pipelines`()	Return the best pipeline.
`get_config_dict`()	Return the config_dict.
`get_optimal_config_dict`()	Return the optimal config_dict.
`get_optimal_connections`()	Return the optimal connections.
`get_workload_classes`(connection_context)	Return the available workload classes information.
`make_future_dataframe`([data, key, ...])	Create a new dataframe for time series prediction.
`persist_progress_log`()	Persist the progress log.
`pipeline_plot`([name, iframe_height])	Pipeline plot.
`predict`(data, group_key[, key, exog, model, ...])	Predict function for MassiveAutomaticTimeSeries.
`score`(data, group_key[, key, endog, exog, ...])	Pipeline model score function.
`set_progress_log_level`(log_level)	Set progress log level to output scorings.
`update_category_map`(connection_context)	Updates the list of operators.
`update_config_dict`(operator_name[, ...])	Updates the config dict.

References

Under the given config_dict and scoring, MassiveAutomaticTimeSeries uses genetic programming to search for the best valid pipeline. Please see Genetic Optimization in AutoML for more details.

Examples

Create a MassiveAutomaticTimeSeries instance:

>>> progress_id = "automl_{}".format(uuid.uuid1())
>>> auto_ts = MassiveAutomaticTimeSeries(generations=2,
                                         population_size=5,
                                         offspring_size=5)
>>> auto_ts.enable_workload_class("MY_WORKLOAD_CLASS")
>>> auto_ts.fit(data=df_ts, group_key='GROUP_ID', key='ID', endog="SERIES")
>>> pipeline = auto_ts.get_best_pipelines()
>>> auto_ts.fit(data=df_ts, group_key='GROUP_ID', key='ID', pipeline=pipeline)
>>> res = auto_ts.predict(data=df_predict, group_key='GROUP_ID', key='ID')

If you want to set the config_dict parameter for a group or some groups specifically, you can set it with group_params parameter:

>>> auto_ts.fit(data=df_ts, group_key="GROUP_ID", key='ID', endog="SERIES",
                group_params={<GROUP ID>: {'config_dict': <YOUR config_dict for this group>}})

cleanup_progress_log(connection_context)¶

Clean up the progress log.

Parameters

connection_contextConnectionContext: The connection object to a SAP HANA database.

delete_config_dict(operator_name=None, category=None, param_name=None)¶

Deletes the content of the config dict.

Parameters

operator_namestr, optional

Deletes the operator based on the given name in the config dict.

Defaults to None.

categorystr, optional

Deletes the whole given category in the config dict.

Defaults to None.

param_namestr, optional

Deletes the parameter based on the given name once the operator name is provided.

Defaults to None.

disable_auto_sql_content(disable=True)¶: Disable auto SQL content logging. Use AFL's default progress logging.

disable_log_cleanup(disable=True)¶: Disable the log clean up.

disable_mlflow_autologging()¶: Disable the mlflow autologging.

disable_workload_class_check()¶: Disable the workload class check. Please note that the MassiveAutomaticClassification/MassiveAutomaticRegression/MassiveAutomaticTimeSeries may cause large resource. Without setting workload class, there's no resource restriction on the training process.

display_config_dict(operator_name=None, category=None)¶

Displays the config dict.

Parameters

operator_namestr, optional

Only displays the information on the given operator name.

Defaults to None.

categorystr, optional

Only displays the information on the given category.

Defaults to None.

display_progress_table(connection_context)¶

Return the progress table.

Parameters

connection_contextConnectionContext: The connection object to a SAP HANA database.

Returns

DataFrame: Progress table.

enable_mlflow_autologging(schema=None, meta=None, is_exported=False, registered_model_name=None)¶

Enable the mlflow autologging.

Parameters

schemastr, optional

Defines the model storage schema for mlflow autologging.

Defaults to the current schema.

metastr, optional

Defines the model storage meta table for mlflow autologging.

Defaults to 'HANAML_MLFLOW_MODEL_STORAGE'.

is_exportedbool, optional

Determines whether export a HANA model to mlflow.

Defaults to False.

registered_model_namestr, optional

MLFlow registered_model_name.

Defaults to None.

evaluate(data, pipeline=None, key=None, features=None, label=None, categorical_variable=None, text_variable=None, resampling_method=None, fold_num=None, random_state=None)¶

Evaluates a pipeline.

Parameters

dataDataFrame

Data for pipeline evaluation.

pipelinejson str or dict

Pipeline to be evaluated.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

text_variablestr or a list of str, optional

It indicates the text column.

Defaults to None.

resampling_methodcharacter, optional

The resampling method for pipeline model evaluation. For different pipeline, the options are different.

regressor: {'cv', 'stratified_cv'}
classifier: {'cv'}
timeseries: {'rocv', 'block'}

Defaults to 'stratified_cv' if the estimator in pipeline is a classifier, and defaults to(and can only be) 'cv' if the estimator in pipeline is a regressor, and defaults to 'rocv' if if the estimator in pipeline is a timeseries.

fold_numint, optional

The fold number for cross validation. If the value is 0, the function will automatically determine the fold number.

Defaults to 5.

random_stateint, optional

Specifies the seed for random number generator.

0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.

Returns

DataFrame: Scores.

get_best_pipeline()¶: Return the best pipeline.

get_best_pipelines()¶: Return the best pipeline.

get_config_dict()¶: Return the config_dict.

get_optimal_config_dict()¶: Return the optimal config_dict. Only available when connections is used.

get_optimal_connections()¶: Return the optimal connections. Only available when connections is used.

get_workload_classes(connection_context)¶

Return the available workload classes information.

Parameters

connection_contextstr, optional: The connection to a SAP HANA instance.

make_future_dataframe(data=None, key=None, group_key=None, periods=1, increment_type='seconds')¶

Create a new dataframe for time series prediction.

Parameters

dataDataFrame, optional

The training data contains the index.

Defaults to the data used in the fit().

keystr, optional

The index defined in the training data.

Defaults to the specified key in fit() or the value in data.index or the PAL's default key column position.

group_keystr, optional

Specify the group id column.

This parameter is only valid when massive is True.

Defaults to the specified group_key in fit() or the first column of the dataframe.

periodsint, optional

The number of rows created in the predict dataframe.

Defaults to 1.

increment_type{'seconds', 'days', 'months', 'years'}, optional

The increment type of the time series.

Defaults to 'seconds'.

Returns

DataFrame

persist_progress_log()¶: Persist the progress log.

pipeline_plot(name='my_pipeline', iframe_height=450)¶

Pipeline plot.

Parameters

namestr, optional

The name of the pipeline plot.

Defaults to 'my_pipeline'.

iframe_heightint, optional

The display height.

Defaults to 450.

set_progress_log_level(log_level)¶

Set progress log level to output scorings.

Parameters

log_level: {'full', 'full_best', 'specified'}: 'full' prints all scores. 'full_best' prints all scores only for the 'current_best' of each generation; other pipelines print only the scores specified by SCORINGS. 'Specified' means all pipelines print only the specified scores.

update_category_map(connection_context)¶

Updates the list of operators.

Parameters

connection_contextstr, optional: The connection to a SAP HANA instance.

update_config_dict(operator_name, param_name=None, param_config=None)¶

Updates the config dict.

Parameters

operator_namestr

The name of operator.

param_namestr, optional

The parameter name to be updated. If the parameter name doesn't exist in the config dict, it will create a new one.

Defaults to None.

param_configany, optional

The parameter config value.

Defaults to None.

fit(data, group_key, key=None, endog=None, exog=None, group_pipelines=None, categorical_variable=None, background_size=None, background_sampling_seed=None, model_table_name=None, use_explain=None, explain_method=None, lag=None, lag_features=None, group_params=None)¶

The fit function for MassiveAutomaticTimeSeries.

Parameters

dataDataFrame

The input time-series data for training.

group_keystr

Name of the group column.

keystr, optional

Specifies the column that represents the ordering of time-series data.

If data is indexed by a single column, then key defaults to that index column; otherwise key must be specified(i.e. is mandatory).

endogstr, optional

Specifies the endogenous variable for time-series data.

Defaults to the 1st non-group_key, non-key column of data

exogstr, optional

Specifies the exogenous variables for time-series data.

Defaults to all non-group_key, non-key, non-endog columns in data.

group_pipelinesstr or dict, optional

Directly use the input pipeline to fit.

Defaults to None.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

model_table_namestr, optional

Specifies the HANA model table name instead of the generated temporary table.

Defaults to None.

use_explainbool, optional

Specifies whether to store information for pipeline explanation.

Defaults to False.

explain_methodstr, optional

Specifies the explanation method. Only valid when use_explain is True.

lagint, a list of int or dict, optional

The number of previous time stamp data used for generating features in current time stamp. Only valid when operator is HGBT_TimeSeries or MLR_TimeSeries.

If lag is a integer or a list of integer, both content of operators 'HGBT_TimeSeries' and 'MLR_TimeSeries' will be updated.

If lag is a dictionary, the key of this dictionary is the name of operator and value could be a integer, a list of integer or a dictionary of range (start, step, stop). Example : {"HGBT_TimeSeries" : 5}, or {"HGBT_TimeSeries" : [5, 7, 9]} or {"HGBT_TimeSeries" : {"range":[1,3,10]}}.

Defaults to minimum of 100 and (data size)/10.

lag_featuresstr, a list of strings or dict, optional

The name of features in time series data used for generating new data features. The name of target column should not be contained. Only valid when operator is HGBT_TimeSeries or MLR_TimeSeries.

If lag_features is a string or a list of strings, both content of operators 'HGBT_TimeSeries' and 'MLR_TimeSeries' will be updated.

If lag_features is a dictionary, the key of this dictionary is the name of operator and value could be a string or a list of strings. Example : {"MLR_TimeSeries" : "FEATURE_A"}, or {"MLR_TimeSeries" : ["FEATURE_A", "FEATURE_B", "FEATURE_C"]}.

Defaults to None.

group_paramsdict, optional

Specifies the group parameters for the prediction.

Defaults to None.

Returns

A fitted object of class "MassiveAutomaticTimeSeries".

predict(data, group_key, key=None, exog=None, model=None, show_explainer=False, predict_args=None, group_params=None, output_prediction_interval=False, confidence_level=None)¶

Predict function for MassiveAutomaticTimeSeries.

Parameters

dataDataFrame

The input time-series data to be predicted.

group_keystr

Name of the group column.

keystr, optional

Specifies the column that represents the ordering of the input time-series data.

If data is indexed by a single column, then key defaults to that index column; otherwise key must be specified(i.e. is mandatory).

exogstr or a list of str, optional

Names of the exogenous variables in data.

Defaults to all non-group_key, non-key columns if not provided.

modelDataFrame, optional

The model to be used for prediction.

Defaults to the fitted model(i.e. self.model_).

show_explainerbool, optional

Reserved paramter for future implementation of SHAP Explainer.

Currently ineffective.

predict_argsdict, optional

Specifies estimator-specific parameters passed to the predict method.

If not None, it must be specified as a dict with one of the following format:

key for estimator name, and value for estimator-specific parameter setting in a dict. For example {'RDT_Classifier':{'block_size': 5}, 'NB_Classifier':{'laplace':1.0}}.

Defaults to None(i.e. no estimator-specific predict parameter provided).

group_paramsdict, optional

Specifies the group parameters for the prediction.

Defaults to None.

output_prediction_intervalbool, optional

Specifies whether to output the prediction interval.

Defaults to None.

confidence_levelfloat, optional

Specifies the confidence level for the prediction interval.

Defaults to None.

Returns

DataFrame: Predicted result.

score(data, group_key, key=None, endog=None, exog=None, model=None, predict_args=None, group_params=None)¶

Pipeline model score function.

Parameters

dataDataFrame

Data for pipeline model scoring.

group_keystr

Name of the group column.

keystr, optional

Specifies the column that represents the ordering of the input time-series data.

If data is indexed by a single column, then key defaults to that index column; otherwise, key must be specified (i.e., it is mandatory).

endogstr, optional

Specifies the endogenous variable for time-series data.

Defaults to the 1st non-group_key, non-key column of data.

exogstr, optional

Specifies the exogenous variables for time-series data.

Defaults to all non-group_key, non-key, non-endog columns in data.

modelDataFrame, optional

The pipeline model used to make predictions.

Defaults to the fitted pipeline model (i.e., self.model_).

predict_argsdict, optional

Specifies estimator-specific parameters passed to the predict phase of the score method.

If not None, it must be specified as a dict with one of the following formats:

key for estimator name, and value for estimator-specific parameter settings in a dict. For example, {'RDT_Classifier': {'block_size': 5}, 'NB_Classifier': {'laplace': 1.0}}.

Defaults to None (i.e., no estimator-specific predict parameter provided).

group_paramsdict, optional

Specifies the group parameters for the prediction.

Defaults to None.

Returns

DataFrames

DataFrame 1 : Prediction result for the input data.
DataFrame 2 : Statistics.

Inherited Methods from PALBase¶

Besides those methods mentioned above, the MassiveAutomaticTimeSeries class also inherits methods from PALBase class, please refer to PAL Base for more details.