Pipeline

class hana_ml.algorithms.pal.pipeline.Pipeline(steps)

Pipeline construction to run transformers and estimators sequentially.

Parameters
steplist

List of (name, transform) tuples that are chained. The last object should be an estimator.

Attributes
fit_hdbprocedure

Returns the generated hdbprocedure for fit.

predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Methods

abap_class_mapping(value)

Mapping the abap class.

add_amdp_item(template_key, value)

Add item.

add_amdp_name(amdp_name)

Add AMDP name.

add_amdp_template(template_name)

Add AMDP template

build_amdp_class()

After add_item, generate amdp file from template.

create_amdp_class(amdp_name, ...)

Create AMDP class file.

disable_mlflow_autologging()

It will disable mlflow autologging.

enable_mlflow_autologging([schema, meta, ...])

It will enable mlflow autologging.

evaluate(data[, key, features, label, ...])

This function is to evaluate a pipeline.

fit(data[, key, features, label, ...])

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

fit_predict(data[, apply_data, fit_params, ...])

Fit all the transforms one after the other and transform the data, then fit_predict the transformed data using the last estimator.

fit_transform(data[, fit_params])

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

generate_json_pipeline()

Generate the json formatted pipeline for auto-ml's pipeline_fit function.

get_amdp_notfillin_key()

Get AMDP not fillin keys.

load_abap_class_mapping()

Load ABAP class mapping.

load_amdp_template(template_name)

Load AMDP template

plot([name, iframe_height])

Plot pipeline.

predict(data[, key, features, model])

Predict function for AutoML.

write_amdp_file([filepath, version, outdir])

Write template to file.

enable_mlflow_autologging(schema=None, meta=None, is_exported=False, registered_model_name=None)

It will enable mlflow autologging. Only works for fit function.

Parameters
schemastr, optional

Define the model storage schema for mlflow autologging.

Defaults to the current schema.

metastr, optional

Define the model storage meta table for mlflow autologging.

Defaults to 'HANAML_MLFLOW_MODEL_STORAGE'.

is_exportedbool, optional

Determine whether export the HANA model to mlflow.

Defaults to False.

registered_model_namestr, optional

MLFlow registered_model_name.

disable_mlflow_autologging()

It will disable mlflow autologging.

fit_transform(data, fit_params=None)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

Parameters
dataDataFrame

SAP HANA DataFrame to be transformed in the pipeline.

fit_paramsdict

The parameters corresponding to the transformers/estimator name where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns
DataFrame

Transformed SAP HANA DataFrame.

Examples

>>> my_pipeline = Pipeline([
        ('pca', PCA(scaling=True, scores=True)),
        ('imputer', Imputer(strategy='mean'))
        ])
>>> fit_params = {'pca__key': 'ID', 'pca__label': 'CLASS'}
>>> my_pipeline.fit_transform(data=train_data, fit_params=fit_params)
fit(data, key=None, features=None, label=None, fit_params=None, categorical_variable=None, generate_json_pipeline=False, use_pal_pipeline_fit=True, endog=None, exog=None, model_table_name=None)

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

Parameters
dataDataFrame

SAP HANA DataFrame to be transformed in the pipeline.

keystr, optional

Name of the ID column.

If key is not provided, then:

  • if data is indexed by a single column, then key defaults to that index column;

  • otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

fit_paramsdict, optional

Parameters corresponding to the transformers/estimator name where each parameter name is prefixed such that parameter p for step s has key s__p.

categorical_variablestr or list of str, optional

Specify INTEGER column(s) that should be be treated as categorical data. Other INTEGER columns will be treated as continuous.

generate_json_pipelinebool, optional

Help generate json formatted pipeline.

Defaults to False.

use_pal_pipeline_fitbool, optional

Use PAL's pipeline fit function instead of the original chain execution.

Defaults to True.

endogstr, optional

Specifies the endogenous variable in time-series data. Please use endog instead of label if data is time-series data.

Defaults to the name of 1st non-key column in data.

exogListOfStrings or str, optional

Specifies the exogenous variables in time-series data. Please use exog instead of features if data is time-series data.

Defaults to

  • the list of names of all non-key, non-endog columns in data if final estimator is not ExponentialSmoothing based

  • [] otherwise.

model_table_namestr, optional

Specifies the HANA model table name instead of the generated temporary table.

Defaults to None.

Returns
DataFrame

Transformed SAP HANA DataFrame.

Examples

>>> my_pipeline = Pipeline([
    ('pca', PCA(scaling=True, scores=True)),
    ('imputer', Imputer(strategy='mean')),
    ('hgbt', HybridGradientBoostingClassifier(
    n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5,
    max_depth=6, cross_validation_range=cv_range))
    ])
>>> fit_params = {'pca__key': 'ID',
                  'pca__label': 'CLASS',
                  'hgbt__key': 'ID',
                  'hgbt__label': 'CLASS',
                  'hgbt__categorical_variable': 'CLASS'}
>>> hgbt_model = my_pipeline.fit(data=train_data, fit_params=fit_params)
predict(data, key=None, features=None, model=None)

Predict function for AutoML.

Parameters
dataDataFrame

Data to be predicted.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or is indexed by multiple columns.

Defaults to the index of data if data is indexed by a single column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

modelDataFrame, optional

The model to be used for prediction.

Defaults to the fitted model (model_).

Returns
DataFrame

Predicted result, structured as follows:

  • 1st column: Data type and name same as the 1st column of data.

  • 2nd column: SCORE, predicted values(for regression) or class labels(for classification).

fit_predict(data, apply_data=None, fit_params=None, predict_params=None)

Fit all the transforms one after the other and transform the data, then fit_predict the transformed data using the last estimator.

Parameters
dataDataFrame

SAP HANA DataFrame to be transformed in the pipeline.

apply_dataDataFrame

SAP HANA DataFrame to be predicted in the pipeline.

fit_paramsdict, optional

Parameters corresponding to the transformers/estimator name where each parameter name is prefixed such that parameter p for step s has key s__p.

predict_paramsdict, optional

Parameters corresponding to the predictor name where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns
DataFrame

Transformed SAP HANA DataFrame.

Examples

>>> my_pipeline = Pipeline([
    ('pca', PCA(scaling=True, scores=True)),
    ('imputer', Imputer(strategy='mean')),
    ('hgbt', HybridGradientBoostingClassifier(
    n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5,
    max_depth=6, cross_validation_range=cv_range))
    ])
>>> fit_params = {'pca__key': 'ID',
                  'pca__label': 'CLASS',
                  'hgbt__key': 'ID',
                  'hgbt__label': 'CLASS',
                  'hgbt__categorical_variable': 'CLASS'}
>>> hgbt_model = my_pipeline.fit_predict(data=train_data, apply_data=test_data, fit_params=fit_params)
plot(name='my_pipeline', iframe_height=450)

Plot pipeline.

generate_json_pipeline()

Generate the json formatted pipeline for auto-ml's pipeline_fit function.

create_amdp_class(amdp_name, training_dataset, apply_dataset)

Create AMDP class file. Then build_amdp_class can be called to generate amdp class.

Parameters
training_datasetstr

Name of training dataset.

apply_datasetstr

Name of apply dataset.

evaluate(data, key=None, features=None, label=None, categorical_variable=None, resampling_method=None, fold_num=None, random_state=None)

This function is to evaluate a pipeline.

Parameters
dataDataFrame

Data for pipeline evaluation.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

categorical_variablestr or list of str, optional

Specify INTEGER column(s) that should be be treated as categorical data. Other INTEGER columns will be treated as continuous.

Defaults to None.

resampling_methodcharacter, optional

The resampling method for pipeline model evaluation. For different pipeline, the options are different.

  • regressor: {'cv', 'stratified_cv'}

  • classifier: {'cv'}

  • timeseries: {'rocv', 'block'}

Defaults to 'stratified_cv' if the estimator in pipeline is a classifier, and defaults to(and can only be) 'cv' if the estimator in pipeline is a regressor, and defaults to 'rocv' if if the estimator in pipeline is a timeseries.

fold_numint, optional

The fold number for cross validation.

Defaults to 5.

random_stateint, optional

Specifies the seed for random number generator.

  • 0: Uses the current time (in seconds) as the seed.

  • Others: Uses the specified value as the seed.

Returns
DataFrame

DataFrame of scores:

  • Score Name.

  • Score Value.

abap_class_mapping(value)

Mapping the abap class.

add_amdp_item(template_key, value)

Add item.

add_amdp_name(amdp_name)

Add AMDP name.

add_amdp_template(template_name)

Add AMDP template

build_amdp_class()

After add_item, generate amdp file from template.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

get_amdp_notfillin_key()

Get AMDP not fillin keys.

load_abap_class_mapping()

Load ABAP class mapping.

load_amdp_template(template_name)

Load AMDP template

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

write_amdp_file(filepath=None, version=1, outdir='out')

Write template to file.

Inherited Methods from PALBase

Besides those methods mentioned above, the Pipeline class also inherits methods from PALBase class, please refer to PAL Base for more details.