Pipeline

class hana_ml.algorithms.pal.pipeline.Pipeline(steps)

Pipeline construction to run transformers and estimators sequentially.

Parameters:

stepslist: List of (name, transform) tuples that are chained. The last object should be an estimator.

Attributes:

fit_hdbprocedure: Returns the generated hdbprocedure for fit.
predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Methods

`abap_class_mapping`(value)	Mapping the abap class.
`add_amdp_item`(template_key, value)	Add item.
`add_amdp_name`(amdp_name)	Add AMDP name.
`add_amdp_template`(template_name)	Add AMDP template
`build_amdp_class`()	After add_item, generate amdp file from template.
`create_amdp_class`(amdp_name, ...)	Create AMDP class file.
`disable_mlflow_autologging`()	It will disable mlflow autologging.
`enable_mlflow_autologging`([schema, meta, ...])	Enables mlflow autologging.
`evaluate`(data[, key, features, label, ...])	Evaluation function for a pipeline.
`fit`(data[, key, features, label, ...])	Fit function for a pipeline.
`fit_predict`(data[, apply_data, fit_params, ...])	Fit all the transforms one after the other and transform the data, then fit_predict the transformed data using the last estimator.
`fit_transform`(data[, fit_params])	Fit all the transforms one after the other and transform the data.
`generate_json_pipeline`()	Generate the json formatted pipeline for auto-ml's pipeline_fit function.
`get_amdp_notfillin_key`()	Get AMDP not fillin keys.
`load_abap_class_mapping`()	Load ABAP class mapping.
`load_amdp_template`(template_name)	Load AMDP template
`plot`([name, iframe_height])	Plot a pipeline.
`predict`(data[, key, features, model])	Predict function for a pipeline.
`score`(data[, key, features, label, model, ...])	Score function for a fitted pipeline model.
`write_amdp_file`([filepath, version, outdir])	Write template to file.

enable_mlflow_autologging(schema=None, meta=None, is_exported=False, registered_model_name=None)

Enables mlflow autologging. Only works for fit function.

Parameters:

schemastr, optional

Defines the model storage schema for mlflow autologging.

Defaults to the current schema.

metastr, optional

Defines the model storage meta table for mlflow autologging.

Defaults to 'HANAML_MLFLOW_MODEL_STORAGE'.

is_exportedbool, optional

Determines whether export the HANA model to mlflow.

Defaults to False.

registered_model_namestr, optional

mlflow registered_model_name.

Defaults to None.

disable_mlflow_autologging(): It will disable mlflow autologging.

fit_transform(data, fit_params=None)

Fit all the transforms one after the other and transform the data.

Parameters:

dataDataFrame

SAP HANA DataFrame to be transformed in the pipeline.

fit_paramsdict, optional

The parameters corresponding to the transformer's name where each parameter name is prefixed such that parameter p for step s has key s__p.

Defaults to None.

Returns:

DataFrame: The transformed SAP HANA DataFrame.

Examples

>>> my_pipeline = Pipeline([
        ('PCA', PCA(scaling=True, scores=True)),
        ('imputer', Imputer(strategy='mean'))
        ])
>>> fit_params = {'PCA__key': 'ID', 'PCA__label': 'CLASS'}
>>> my_pipeline.fit_transform(data=train_df, fit_params=fit_params)

fit(data, key=None, features=None, label=None, fit_params=None, categorical_variable=None, generate_json_pipeline=False, use_pal_pipeline_fit=True, endog=None, exog=None, model_table_name=None)

Fit function for a pipeline.

Parameters:

dataDataFrame

SAP HANA DataFrame.

keystr, optional

Name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

fit_paramsdict, optional

Parameters corresponding to the transformers/estimator name where each parameter name is prefixed such that parameter p for step s has key s__p.

Defaults to None.

categorical_variablestr or list of str, optional

Specify INTEGER column(s) that should be be treated as categorical data. Other INTEGER columns will be treated as continuous.

generate_json_pipelinebool, optional

Help generate json formatted pipeline.

Defaults to False.

use_pal_pipeline_fitbool, optional

Use PAL's pipeline fit function instead of the original chain execution.

Defaults to True.

endogstr, optional

Specifies the endogenous variable in time-series data. Please use endog instead of label if data is time-series data.

Defaults to the name of 1st non-key column in data.

exogstr or a list of str, optional

Specifies the exogenous variables in time-series data. Please use exog instead of features if data is time-series data.

Defaults to

the list of names of all non-key, non-endog columns in data if final estimator is not ExponentialSmoothing based

[] otherwise.

model_table_namestr, optional

Specifies the HANA model table name instead of the generated temporary table.

Defaults to None.

Examples

>>> my_pipeline = Pipeline([
    ('pca', PCA(scaling=True, scores=True)),
    ('imputer', Imputer(strategy='mean')),
    ('hgbt', HybridGradientBoostingClassifier(
    n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5,
    max_depth=6, cross_validation_range=cv_range))
    ])
>>> fit_params = {'pca__key': 'ID',
                  'pca__label': 'CLASS',
                  'hgbt__key': 'ID',
                  'hgbt__label': 'CLASS',
                  'hgbt__categorical_variable': 'CLASS'}
>>> hgbt_model = my_pipeline.fit(data=train_data, fit_params=fit_params)

predict(data, key=None, features=None, model=None)

Predict function for a pipeline.

Parameters:

dataDataFrame

SAP HANA DataFrame.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or is indexed by multiple columns.

Defaults to the index of data if data is indexed by a single column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

modelDataFrame, optional

The model to be used for prediction.

Defaults to the fitted model (model_).

Returns:

DataFrame

Predicted result, structured as follows:

1st column: Data type and name same as the 1st column of data.
2nd column: SCORE, predicted values(for regression) or class labels(for classification).

Attributes:

predict_info_DataFrame

Structured as follows:

1st column: STAT_NAME.
2nd column: STAT_VALUE.

score(data, key=None, features=None, label=None, model=None, random_state=None, top_k_attributions=None, sample_size=None, verbose_output=None)

Score function for a fitted pipeline model.

Parameters:

dataDataFrame

SAP HANA DataFrame.

keystr, optional

Name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

modelstr, optional

The trained model.

Defaults to self.model_.

random_stateint, optional

Specifies the seed for random number generator.

0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.

Valid only when model table has background data information.

Defaults to 0.

top_k_attributionsstr, optional

Outputs the attributions of top k features which contribute the most. Valid only when model table has background data information.

Defaults to 10.

sample_sizeint, optional

Specifies the number of sampled combinations of features.

It is better to use a number that is greater than the number of features in data.

If set as 0, it is determined by algorithm heuristically.

Defaults to 0.

verbose_outputbool, optional

Specifies whether to output all classes and the corresponding confidences for each data.

True: Outputs the probability of all label categories.
False: Outputs the category of the highest probability only.

Valid only for classification.

Defaults to False.

Returns:

DataFrame 1

Prediction result, structured as follows:

1st column, ID of input data.
2nd column, SCORE, class assignment.
3rd column, REASON CODE, attribution of features.
4th & 5th column, placeholder columns for future implementations.

DataFrame 2

Statistics, structured as follows:

1st column, STAT_NAME
2nd column, STAT_VALUE

fit_predict(data, apply_data=None, fit_params=None, predict_params=None)

Fit all the transforms one after the other and transform the data, then fit_predict the transformed data using the last estimator.

Parameters:

dataDataFrame

SAP HANA DataFrame to be transformed in the pipeline.

apply_dataDataFrame

SAP HANA DataFrame to be predicted in the pipeline.

Defaults to None.

fit_paramsdict, optional

Parameters corresponding to the transformers/estimator name where each parameter name is prefixed such that parameter p for step s has key s__p.

Defaults to None.

predict_paramsdict, optional

Parameters corresponding to the predictor name where each parameter name is prefixed such that parameter p for step s has key s__p.

Defaults to None.

Returns:

DataFrame: A SAP HANA DataFrame.

Examples

>>> my_pipeline = Pipeline([
    ('pca', PCA(scaling=True, scores=True)),
    ('imputer', Imputer(strategy='mean')),
    ('hgbt', HybridGradientBoostingClassifier(
    n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5,
    max_depth=6, cross_validation_range=cv_range))
    ])
>>> fit_params = {'pca__key': 'ID',
                  'pca__label': 'CLASS',
                  'hgbt__key': 'ID',
                  'hgbt__label': 'CLASS',
                  'hgbt__categorical_variable': 'CLASS'}
>>> hgbt_model = my_pipeline.fit_predict(data=train_data, apply_data=test_data, fit_params=fit_params)

plot(name='my_pipeline', iframe_height=450)

Plot a pipeline.

Parameters:

namestr, optional

Pipeline Name.

Defaults to "my_pipeline".

iframe_heightint, optional

Height of iframe.

Defaults to 450.

generate_json_pipeline(): Generate the json formatted pipeline for auto-ml's pipeline_fit function.

create_amdp_class(amdp_name, training_dataset, apply_dataset)

Create AMDP class file. Then build_amdp_class can be called to generate amdp class.

Parameters:

amdp_namestr: Name of amdp.
training_datasetstr: Name of training dataset.
apply_datasetstr: Name of apply dataset.

evaluate(data, key=None, features=None, label=None, categorical_variable=None, resampling_method=None, fold_num=None, random_state=None)

Evaluation function for a pipeline.

Parameters:

dataDataFrame

SAP HANA DataFrame.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the name of the last non-ID column.

categorical_variablestr or list of str, optional

Specify INTEGER column(s) that should be be treated as categorical data. Other INTEGER columns will be treated as continuous.

Defaults to None.

resampling_methodcharacter, optional

The resampling method for pipeline model evaluation. For different pipeline, the options are different.

regressor: {'cv', 'stratified_cv'}
classifier: {'cv'}
timeseries: {'rocv', 'block', 'simple_split'}

Defaults to 'stratified_cv' if the estimator in pipeline is a classifier, and defaults to(and can only be) 'cv' if the estimator in pipeline is a regressor, and defaults to 'rocv' if if the estimator in pipeline is a timeseries.

fold_numint, optional

The fold number for cross validation.

Defaults to 5.

random_stateint, optional

Specifies the seed for random number generator.

0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.

Defaults to 0.

Returns:

DataFrame

1st column, NAME, Score name
2nd column, VALUE, Score value

abap_class_mapping(value): Mapping the abap class.

add_amdp_item(template_key, value): Add item.

add_amdp_name(amdp_name): Add AMDP name.

add_amdp_template(template_name): Add AMDP template

build_amdp_class(): After add_item, generate amdp file from template.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

get_amdp_notfillin_key(): Get AMDP not fillin keys.

load_abap_class_mapping(): Load ABAP class mapping.

load_amdp_template(template_name): Load AMDP template

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

write_amdp_file(filepath=None, version=1, outdir='out'): Write template to file.

Inherited Methods from PALBase

Besides those methods mentioned above, the Pipeline class also inherits methods from PALBase class, please refer to PAL Base for more details.