Pipeline

class hana_ml.algorithms.pal.pipeline.Pipeline(steps=None, pipeline=None)

Pipeline construction to run transformers and estimators sequentially.

Parameters:

stepslist: List of (name, transform) tuples that are chained. The last object should be an estimator.

Methods

`abap_class_mapping`(value)	Mapping the abap class.
`add_amdp_item`(template_key, value)	Add item.
`add_amdp_name`(amdp_name)	Add AMDP name.
`add_amdp_template`(template_name)	Add AMDP template
`build_amdp_class`()	After add_item, generate amdp file from template.
`create_amdp_class`(amdp_name, ...)	Create AMDP class file.
`disable_mlflow_autologging`()	It will disable mlflow autologging.
`enable_mlflow_autologging`([schema, meta, ...])	Enables mlflow autologging.
`evaluate`(data[, key, features, label, ...])	Evaluation function for a pipeline.
`fit`(data[, key, features, label, ...])	Fit function for a pipeline.
`fit_predict`(data[, apply_data, fit_params, ...])	Fit all the transformers one after another and transform the data, then fit_predict the transformed data using the final estimator.
`fit_transform`(data[, fit_params])	Fit all the transforms one after the other and transform the data.
`generate_json_pipeline`([pivot])	Generate the json formatted pipeline for pipeline fit function.
`get_amdp_notfillin_key`()	Get AMDP not fillin keys.
`get_model_metrics`()	Get the model metrics.
`get_score_metrics`()	Get the score metrics.
`load_abap_class_mapping`()	Load ABAP class mapping.
`load_amdp_template`(template_name)	Load AMDP template
`plot`([name, iframe_height])	Plot a pipeline.
`predict`(data[, key, features, model, exog, ...])	Predict function for a pipeline.
`score`(data[, key, features, label, model, ...])	Score function for a fitted pipeline model.
`transform`([data, key, features, label])	Transform function for a pipeline.
`write_amdp_file`([filepath, version, outdir])	Write template to file.

enable_mlflow_autologging(schema=None, meta=None, is_exported=False, registered_model_name=None)

Enables mlflow autologging. Only works for fit function.

Parameters:

schemastr, optional

Defines the model storage schema for mlflow autologging.

Defaults to the current schema.

metastr, optional

Defines the model storage meta table for mlflow autologging.

Defaults to 'HANAML_MLFLOW_MODEL_STORAGE'.

is_exportedbool, optional

Determines whether export the HANA model to mlflow.

Defaults to False.

registered_model_namestr, optional

mlflow registered_model_name.

Defaults to None.

disable_mlflow_autologging(): It will disable mlflow autologging.

fit_transform(data, fit_params=None)

Fit all the transforms one after the other and transform the data.

Parameters:

dataDataFrame

SAP HANA DataFrame to be transformed in the pipeline.

fit_paramsdict, optional

The parameters corresponding to the transformer's name where each parameter name is prefixed such that parameter p for step s has key s__p.

Defaults to None.

Returns:

DataFrame: The transformed SAP HANA DataFrame.

Examples

>>> my_pipeline = Pipeline([
        ('PCA', PCA(scaling=True, scores=True)),
        ('imputer', Imputer(strategy='mean'))
        ])
>>> fit_params = {'PCA__key': 'ID', 'PCA__label': 'CLASS'}
>>> my_pipeline.fit_transform(data=train_df, fit_params=fit_params)

fit(data, key=None, features=None, label=None, fit_params=None, categorical_variable=None, generate_json_pipeline=False, use_pal_pipeline_fit=True, endog=None, exog=None, model_table_name=None, use_explain=None, explain_method=None, background_size=None, background_sampling_seed=None)

Fit function for a pipeline.

Parameters:

dataDataFrame

SAP HANA DataFrame.

keystr, optional

Name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it is assumed that data contains no ID column.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

fit_paramsdict, optional

Parameters corresponding to the transformers/estimator name where each parameter name is prefixed such that parameter p for step s has key s__p.

Defaults to None.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

generate_json_pipelinebool, optional

Help generate json formatted pipeline.

Defaults to False.

use_pal_pipeline_fitbool, optional

Use PAL's pipeline fit function instead of the original chain execution.

Defaults to True.

endogstr, optional

Specifies the endogenous variable in time-series data. Please use endog instead of label if data is time-series data.

Defaults to the name of 1st non-key column in data.

exogstr or a list of str, optional

Specifies the exogenous variables in time-series data. Please use exog instead of features if data is time-series data.

Defaults to

the list of names of all non-key, non-endog columns in data if final estimator is not ExponentialSmoothing based

[] otherwise.

model_table_namestr, optional

Specifies the HANA model table name instead of the generated temporary table.

Defaults to None.

use_explainbool, optional

Specifies whether to store information for pipeline explanation. Please note that this option is applicable only when the estimator in the pipeline is either a Classifier or a Regressor.

Defaults to False.

explain_methodstr, optional

Specifies the explanation method. Only valid when use_explain is set to True.

Options are:

'kernelshap' : This method makes explanations by utilizing the Kernel SHAP. For this option to be functional, the background_size parameter should be greater than 0.
'globalsurrogate' : This method makes explanations by utilizing the Global Surrogate method.

Defaults to 'globalsurrogate'.

background_sizeint, optional

The number of background data used in Kernel SHAP. Only valid explain_method is 'kernelshap'. It should not be larger than the row size of train data.

Defaults to None.

background_sampling_seedint, optional

Specifies the seed for random number generator in the background sampling. Only valid when explain_method is 'kernelshap'.

0: Uses the current time (in second) as seed
Others: Uses the specified value as seed

Defaults to 0.

Examples

>>> my_pipeline = Pipeline([
    ('pca', PCA(scaling=True, scores=True)),
    ('imputer', Imputer(strategy='mean')),
    ('hgbt', HybridGradientBoostingClassifier(
    n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5,
    max_depth=6, cross_validation_range=cv_range))
    ])
>>> fit_params = {'pca__key': 'ID',
                  'pca__label': 'CLASS',
                  'hgbt__key': 'ID',
                  'hgbt__label': 'CLASS',
                  'hgbt__categorical_variable': 'CLASS'}
>>> hgbt_model = my_pipeline.fit(data=train_data, fit_params=fit_params)

predict(data, key=None, features=None, model=None, exog=None, predict_args=None, show_explainer=False, top_k_attributions=None, random_state=None, sample_size=None, verbose_output=None)

Predict function for a pipeline.

Parameters:

dataDataFrame

SAP HANA DataFrame.

keystr, optional

Name of the ID column. Mandatory if data is not indexed, or is indexed by multiple columns.

Defaults to the index of data if data is indexed by a single column.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

modelDataFrame, optional

The model to be used for prediction.

Defaults to the fitted model (model_).

exoglist of str, optional

Names of exogenous variables.

Please use exog instead of features when the final estimator of the Pipeline object is for TimeSeries.

Defaults to all non-key columns if not provided.

predict_argsdict, optional

Specifies the parameters for the predict method of the estimator of the target pipeline, with keys being parameter names and values being parameter values.

For example, suppose the input pipeline is [('PCA', PCA()), ('RDT', RDTClassifier(algorithms='cart')], and the estimator RDTClassifier can take the following parameters when making predictions: block_size, missing_replacement. Then, we can specify the values of these two parameters as follows:

predict_args = {'block_size':5, 'missing_replacement':'instance_marginalized'}

Defaults to None.

show_explainerbool, optional

If True, the reason code of the pipelie will be returned. Please note that this option is applicable only when the estimator in the pipeline is either a Classifier or a Regressor.

Defaults to False

top_k_attributionsint, optional

Displays the top k attributions in reason code. Only valid when show_explainer is set to True.

Effective only when model contains background data from the training phase.

Defaults to PAL's default value.

random_stateDataFrame, optional

Specifies the random seed. Only valid when show_explainer is set to True.

Defaults to 0(system time).

sample_sizeint, optional

Specifies the number of sampled combinations of features. Only valid when show_explainer is set to True.

It is better to use a number that is greater than the number of features in data.

If set as 0, it is determined by algorithm heuristically.

Defaults to 0.

verbose_outputbool, optional

True: Outputs the probability of all label categories.
False: Outputs the category of the highest probability only.

Only valid when show_explainer is set to True.

Defaults to True.

Returns:

DataFrame

Predicted result, structured as follows:

1st column: Data type and name same as the 1st column of data.
2nd column: SCORE, predicted values(for regression) or class labels(for classification).
3rd column: CONFIDENCE, confidence of a class (available only if show_explainer is True).
4th column: REASON CODE, attributions of features (available only if show_explainer is True).
5th & 6th columns: placeholder columns for future implementations(available only if show_explainer is True).

Attributes:

predict_info_DataFrame

Structured as follows:

1st column: STAT_NAME.
2nd column: STAT_VALUE.

score(data, key=None, features=None, label=None, model=None, random_state=None, top_k_attributions=None, sample_size=None, verbose_output=None, predict_args=None, endog=None, exog=None)

Score function for a fitted pipeline model.

Parameters:

dataDataFrame

SAP HANA DataFrame.

keystr, optional

Name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it is assumed that data contains no ID column.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

modelstr, optional

The trained model.

Defaults to self.model_.

random_stateint, optional

Specifies the seed for random number generator.

0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.

Valid only when model table has background data information.

Defaults to 0.

top_k_attributionsstr, optional

Outputs the attributions of top k features which contribute the most. Valid only when model table has background data information.

Defaults to 10.

sample_sizeint, optional

Specifies the number of sampled combinations of features.

It is better to use a number that is greater than the number of features in data.

If set as 0, it is determined by algorithm heuristically.

Defaults to 0.

verbose_outputbool, optional

Specifies whether to output all classes and the corresponding confidences for each data.

True: Outputs the probability of all label categories.
False: Outputs the category of the highest probability only.

Valid only for classification.

Defaults to False.

predict_argsdict, optional

Specifies the parameters for the predict method of the estimator of the target pipeline, with keys being parameter names and values being parameter values.

For example, suppose the input pipeline is [('PCA', PCA()), ('RDT', RDTClassifier(algorithms='cart')], and the estimator RDTClassifier can take the following parameters when making predictions: block_size, missing_replacement. Then, we can specify the values of these two parameters as follows:

predict_args = {'block_size':5, 'missing_replacement':'instance_marginalized'}

Defaults to None.

endogstr, optional

Specifies the endogenous variable in time-series data.

Please use endog instead of label if data is time-series.

Defaults to the name of 1st non-key column in data.

exogstr or a list of str, optional

Specifies the exogenous variables in time-series data.

Please use exog instead of features if data is time-series.

Defaults to

the list of names of all non-key, non-endog columns in data if final estimator is not ExponentialSmoothing based

[] otherwise.

Returns:

DataFrame 1

Prediction result, structured as follows:

1st column, ID of input data.
2nd column, SCORE, class assignment.
3rd column, REASON CODE, attribution of features.
4th & 5th column, placeholder columns for future implementations.

DataFrame 2

Statistics, structured as follows:

1st column, STAT_NAME
2nd column, STAT_VALUE

fit_predict(data, apply_data=None, fit_params=None, predict_params=None)

Fit all the transformers one after another and transform the data, then fit_predict the transformed data using the final estimator.

Parameters:

dataDataFrame

SAP HANA DataFrame to be transformed in the pipeline.

apply_dataDataFrame

SAP HANA DataFrame to be predicted in the pipeline.

Defaults to None.

fit_paramsdict, optional

Parameters corresponding to the transformers/estimator name where each parameter name is prefixed such that parameter p for step s has key s__p.

Defaults to None.

predict_paramsdict, optional

Parameters corresponding to the predictor name where each parameter name is prefixed such that parameter p for step s has key s__p.

Returns:

DataFrame: A SAP HANA DataFrame.

Examples

>>> my_pipeline = Pipeline([
    ('pca', PCA(scaling=True, scores=True)),
    ('imputer', Imputer(strategy='mean')),
    ('hgbt', HybridGradientBoostingClassifier(
    n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5,
    max_depth=6, cross_validation_range=cv_range))
    ])
>>> fit_params = {'pca__key': 'ID',
                  'pca__label': 'CLASS',
                  'hgbt__key': 'ID',
                  'hgbt__label': 'CLASS',
                  'hgbt__categorical_variable': 'CLASS'}
>>> hgbt_model = my_pipeline.fit_predict(data=train_data, apply_data=test_data, fit_params=fit_params)

abap_class_mapping(value): Mapping the abap class.

add_amdp_item(template_key, value): Add item.

add_amdp_name(amdp_name): Add AMDP name.

add_amdp_template(template_name): Add AMDP template

build_amdp_class(): After add_item, generate amdp file from template.

get_amdp_notfillin_key(): Get AMDP not fillin keys.

get_model_metrics()

Get the model metrics.

Returns:

DataFrame: The model metrics.

get_score_metrics()

Get the score metrics.

Returns:

DataFrame: The score metrics.

load_abap_class_mapping(): Load ABAP class mapping.

load_amdp_template(template_name): Load AMDP template

plot(name='my_pipeline', iframe_height=450)

Plot a pipeline.

Parameters:

namestr, optional

Pipeline Name.

Defaults to "my_pipeline".

iframe_heightint, optional

Height of iframe.

Defaults to 450.

write_amdp_file(filepath=None, version=1, outdir='out'): Write template to file.

generate_json_pipeline(pivot=False): Generate the json formatted pipeline for pipeline fit function.

create_amdp_class(amdp_name, training_dataset, apply_dataset)

Create AMDP class file. Then build_amdp_class can be called to generate amdp class.

Parameters:

amdp_namestr: Name of amdp.
training_datasetstr: Name of training dataset.
apply_datasetstr: Name of apply dataset.

evaluate(data, key=None, features=None, label=None, categorical_variable=None, resampling_method=None, fold_num=None, random_state=None, endog=None, exog=None, gap_num=None, percentage=None)

Evaluation function for a pipeline.

Parameters:

dataDataFrame

SAP HANA DataFrame.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

resampling_methodcharacter, optional

The resampling method for pipeline model evaluation. For different pipeline, the options are different.

regressor: {'cv', 'stratified_cv'}
classifier: {'cv'}
timeseries: {'rocv', 'block', 'simple_split'}

Defaults to 'stratified_cv' if the estimator in pipeline is a classifier, and defaults to(and can only be) 'cv' if the estimator in pipeline is a regressor, and defaults to 'rocv' if the estimator in pipeline is a timeseries.

fold_numint, optional

The fold number for cross validation.

Defaults to 5.

random_stateint, optional

Specifies the seed for random number generator.

0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.

Defaults to 0.

endogstr, optional

Specifies the endogenous variable in time-series data. Please use endog instead of label if data is time-series.

Defaults to the name of 1st non-key column in data.

exogstr or a list of str, optional

Specifies the exogenous variables in time-series data. Please use exog instead of features if data is time-series.

Defaults to

the list of names of all non-key, non-endog columns in data if final estimator is not ExponentialSmoothing based

[] otherwise.

gap_numint, optional

Number of samples to exclude from the end of each train set before the test set. Valid only if the final estimator of the target pipeline is for time-series.

Defaults to 0.

percentagefloat, optional

Percentage between training data and test data. Only applicable when the final estimator of the target pipeline is for time-series, and resampling_method is set as 'block'.

Defaults to 0.7.

Returns:

DataFrame

1st column, NAME, Score name
2nd column, VALUE, Score value

transform(data=None, key=None, features=None, label=None)

Transform function for a pipeline.

Parameters:

dataDataFrame, optional

SAP HANA DataFrame.

Defaults to None.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

Returns:

The transformed DataFrame.

Inherited Methods from PALBase

Besides those methods mentioned above, the Pipeline class also inherits methods from PALBase class, please refer to PAL Base for more details.