Pipeline
- class hana_ml.algorithms.pal.pipeline.Pipeline(steps=None, pipeline=None)
Pipeline construction to run transformers and estimators sequentially.
- Parameters:
- stepslist
List of (name, transform) tuples that are chained. The last object should be an estimator.
Methods
abap_class_mapping
(value)Mapping the abap class.
add_amdp_item
(template_key, value)Add item.
add_amdp_name
(amdp_name)Add AMDP name.
add_amdp_template
(template_name)Add AMDP template
After add_item, generate amdp file from template.
create_amdp_class
(amdp_name, ...)Create AMDP class file.
It will disable mlflow autologging.
enable_mlflow_autologging
([schema, meta, ...])Enables mlflow autologging.
evaluate
(data[, key, features, label, ...])Evaluation function for a pipeline.
fit
(data[, key, features, label, ...])Fit function for a pipeline.
fit_predict
(data[, apply_data, fit_params, ...])Fit all the transformers one after another and transform the data, then fit_predict the transformed data using the final estimator.
fit_transform
(data[, fit_params])Fit all the transforms one after the other and transform the data.
generate_json_pipeline
([pivot])Generate the json formatted pipeline for pipeline fit function.
Get AMDP not fillin keys.
Load ABAP class mapping.
load_amdp_template
(template_name)Load AMDP template
plot
([name, iframe_height])Plot a pipeline.
predict
(data[, key, features, model, exog, ...])Predict function for a pipeline.
score
(data[, key, features, label, model, ...])Score function for a fitted pipeline model.
transform
([data, key, features, label])Transform function for a pipeline.
write_amdp_file
([filepath, version, outdir])Write template to file.
- enable_mlflow_autologging(schema=None, meta=None, is_exported=False, registered_model_name=None)
Enables mlflow autologging. Only works for fit function.
- Parameters:
- schemastr, optional
Defines the model storage schema for mlflow autologging.
Defaults to the current schema.
- metastr, optional
Defines the model storage meta table for mlflow autologging.
Defaults to 'HANAML_MLFLOW_MODEL_STORAGE'.
- is_exportedbool, optional
Determines whether export the HANA model to mlflow.
Defaults to False.
- registered_model_namestr, optional
mlflow registered_model_name.
Defaults to None.
- disable_mlflow_autologging()
It will disable mlflow autologging.
- fit_transform(data, fit_params=None)
Fit all the transforms one after the other and transform the data.
- Parameters:
- dataDataFrame
SAP HANA DataFrame to be transformed in the pipeline.
- fit_paramsdict, optional
The parameters corresponding to the transformer's name where each parameter name is prefixed such that parameter p for step s has key s__p.
Defaults to None.
- Returns:
- DataFrame
The transformed SAP HANA DataFrame.
Examples
>>> my_pipeline = Pipeline([ ('PCA', PCA(scaling=True, scores=True)), ('imputer', Imputer(strategy='mean')) ]) >>> fit_params = {'PCA__key': 'ID', 'PCA__label': 'CLASS'} >>> my_pipeline.fit_transform(data=train_df, fit_params=fit_params)
- fit(data, key=None, features=None, label=None, fit_params=None, categorical_variable=None, generate_json_pipeline=False, use_pal_pipeline_fit=True, endog=None, exog=None, model_table_name=None, use_explain=None, explain_method=None, background_size=None, background_sampling_seed=None)
Fit function for a pipeline.
- Parameters:
- dataDataFrame
SAP HANA DataFrame.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- fit_paramsdict, optional
Parameters corresponding to the transformers/estimator name where each parameter name is prefixed such that parameter p for step s has key s__p.
Defaults to None.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- generate_json_pipelinebool, optional
Help generate json formatted pipeline.
Defaults to False.
- use_pal_pipeline_fitbool, optional
Use PAL's pipeline fit function instead of the original chain execution.
Defaults to True.
- endogstr, optional
Specifies the endogenous variable in time-series data. Please use
endog
instead oflabel
ifdata
is time-series data.Defaults to the name of 1st non-key column in
data
.- exogstr or a list of str, optional
Specifies the exogenous variables in time-series data. Please use
exog
instead offeatures
ifdata
is time-series data.Defaults to
the list of names of all non-key, non-endog columns in
data
if final estimator is not ExponentialSmoothing based[] otherwise.
- model_table_namestr, optional
Specifies the HANA model table name instead of the generated temporary table.
Defaults to None.
- use_explainbool, optional
Specifies whether to store information for pipeline explaination. Please note that this option is applicable only when the estimator in the pipeline is either a Classifier/Regressor/Timeseries.
Defaults to False.
- explain_methodstr, optional
Specifies the explaination method. Only valid when use_explain is set to True. Only valid when the estimator in the pipeline is either a Classifier/Regressor.
Options are:
'kernelshap' : This method makes explainations by utilizing the Kernel SHAP. For this option to be functional, the
background_size
parameter should be greater than 0.'globalsurrogate' : This method makes explainations by utilizing the Global Surrogate method.
Defaults to 'globalsurrogate'.
- background_sizeint, optional
The number of background data used in Kernel SHAP. Only valid
explain_method
is 'kernelshap'. It should not be larger than the row size of train data.Dependencies:
Classifier/Regressor: This option is only valid when
use_explain
is set to True, andexplain_method
is 'kernelshap'.Timeseries: This option is only valid when
use_explain
is True.
Defaults to None.
- background_sampling_seedint, optional
Specifies the seed for random number generator in the background sampling.
0: Uses the current time (in second) as seed
Others: Uses the specified value as seed
Dependencies:
Classifier/Regressor: This option is only valid when
use_explain
is set to True, andexplain_method
is 'kernelshap'.Timeseries: This option is only valid when
use_explain
is True.
Defaults to 0.
Examples
>>> my_pipeline = Pipeline([ ('pca', PCA(scaling=True, scores=True)), ('imputer', Imputer(strategy='mean')), ('hgbt', HybridGradientBoostingClassifier( n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5, max_depth=6, cross_validation_range=cv_range)) ]) >>> fit_params = {'pca__key': 'ID', 'pca__label': 'CLASS', 'hgbt__key': 'ID', 'hgbt__label': 'CLASS', 'hgbt__categorical_variable': 'CLASS'} >>> hgbt_model = my_pipeline.fit(data=train_data, fit_params=fit_params)
- predict(data, key=None, features=None, model=None, exog=None, predict_args=None, show_explainer=False, top_k_attributions=None, random_state=None, sample_size=None, verbose_output=None)
Predict function for a pipeline.
- Parameters:
- dataDataFrame
SAP HANA DataFrame.
- keystr, optional
Name of the ID column. Mandatory if
data
is not indexed, or is indexed by multiple columns.Defaults to the index of
data
ifdata
is indexed by a single column.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- modelDataFrame, optional
The model to be used for prediction.
Defaults to the fitted model (model_).
- exoglist of str, optional
Names of exogenous variables.
Please use
exog
instead offeatures
when the final estimator of the Pipeline object is for TimeSeries.Defaults to all non-key columns if not provided.
- predict_argsdict, optional
Specifies the parameters for the predict method of the estimator of the target pipeline, with keys being parameter names and values being parameter values.
For example, suppose the input pipeline is [('PCA', PCA()), ('RDT', RDTClassifier(algorithms='cart')], and the estimator RDTClassifier can take the following parameters when making predictions:
block_size
,missing_replacement
. Then, we can specify the values of these two parameters as follows:predict_args = {'block_size':5, 'missing_replacement':'instance_marginalized'}
Defaults to None.
- show_explainerbool, optional
If True, the reason code of the pipelie will be returned. Please note that this option is applicable only when the estimator in the pipeline is either a Classifier or a Regressor.
Defaults to False
- top_k_attributionsint, optional
Displays the top k attributions in reason code. Only valid when
show_explainer
is set to True.Effective only when
model
contains background data from the training phase.Defaults to PAL's default value.
- random_stateDataFrame, optional
Specifies the random seed. Only valid when
show_explainer
is set to True.Defaults to 0(system time).
- sample_sizeint, optional
Specifies the number of sampled combinations of features. Only valid when
show_explainer
is set to True.It is better to use a number that is greater than the number of features in
data
.If set as 0, it is determined by algorithm heuristically.
Defaults to 0.
- verbose_outputbool, optional
True: Outputs the probability of all label categories.
False: Outputs the category of the highest probability only.
Only valid when
show_explainer
is set to True.Defaults to True.
- Returns:
- DataFrame
Predicted result, structured as follows:
1st column: Data type and name same as the 1st column of
data
.2nd column: SCORE, predicted values(for regression) or class labels(for classification).
3rd column: CONFIDENCE, confidence of a class (available only if
show_explainer
is True).4th column: REASON CODE, attributions of features (available only if
show_explainer
is True).5th & 6th columns: placeholder columns for future implementations(available only if
show_explainer
is True).
- Attributes:
- predict_info_DataFrame
Structured as follows:
1st column: STAT_NAME.
2nd column: STAT_VALUE.
- score(data, key=None, features=None, label=None, model=None, random_state=None, top_k_attributions=None, sample_size=None, verbose_output=None, predict_args=None, endog=None, exog=None)
Score function for a fitted pipeline model.
- Parameters:
- dataDataFrame
SAP HANA DataFrame.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- modelstr, optional
The trained model.
Defaults to self.model_.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.
Valid only when model table has background data information.
Defaults to 0.
- top_k_attributionsstr, optional
Outputs the attributions of top k features which contribute the most. Valid only when model table has background data information.
Defaults to 10.
- sample_sizeint, optional
Specifies the number of sampled combinations of features.
It is better to use a number that is greater than the number of features in
data
.If set as 0, it is determined by algorithm heuristically.
Defaults to 0.
- verbose_outputbool, optional
Specifies whether to output all classes and the corresponding confidences for each data.
True: Outputs the probability of all label categories.
False: Outputs the category of the highest probability only.
Valid only for classification.
Defaults to False.
- predict_argsdict, optional
Specifies the parameters for the predict method of the estimator of the target pipeline, with keys being parameter names and values being parameter values.
For example, suppose the input pipeline is [('PCA', PCA()), ('RDT', RDTClassifier(algorithms='cart')], and the estimator RDTClassifier can take the following parameters when making predictions:
block_size
,missing_replacement
. Then, we can specify the values of these two parameters as follows:predict_args = {'block_size':5, 'missing_replacement':'instance_marginalized'}
Defaults to None.
- endogstr, optional
Specifies the endogenous variable in time-series data.
Please use
endog
instead oflabel
ifdata
is time-series.Defaults to the name of 1st non-key column in
data
.- exogstr or a list of str, optional
Specifies the exogenous variables in time-series data.
Please use
exog
instead offeatures
ifdata
is time-series.Defaults to
the list of names of all non-key, non-endog columns in
data
if final estimator is not ExponentialSmoothing based[] otherwise.
- Returns:
- DataFrame 1
Prediction result, structured as follows:
1st column, ID of input data.
2nd column, SCORE, class assignment.
3rd column, REASON CODE, attribution of features.
4th & 5th column, placeholder columns for future implementations.
- DataFrame 2
Statistics, structured as follows:
1st column, STAT_NAME
2nd column, STAT_VALUE
- fit_predict(data, apply_data=None, fit_params=None, predict_params=None)
Fit all the transformers one after another and transform the data, then fit_predict the transformed data using the final estimator.
- Parameters:
- dataDataFrame
SAP HANA DataFrame to be transformed in the pipeline.
- apply_dataDataFrame
SAP HANA DataFrame to be predicted in the pipeline.
Defaults to None.
- fit_paramsdict, optional
Parameters corresponding to the transformers/estimator name where each parameter name is prefixed such that parameter p for step s has key s__p.
Defaults to None.
- predict_paramsdict, optional
Parameters corresponding to the predictor name where each parameter name is prefixed such that parameter p for step s has key s__p.
- Returns:
- DataFrame
A SAP HANA DataFrame.
Examples
>>> my_pipeline = Pipeline([ ('pca', PCA(scaling=True, scores=True)), ('imputer', Imputer(strategy='mean')), ('hgbt', HybridGradientBoostingClassifier( n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5, max_depth=6, cross_validation_range=cv_range)) ]) >>> fit_params = {'pca__key': 'ID', 'pca__label': 'CLASS', 'hgbt__key': 'ID', 'hgbt__label': 'CLASS', 'hgbt__categorical_variable': 'CLASS'} >>> hgbt_model = my_pipeline.fit_predict(data=train_data, apply_data=test_data, fit_params=fit_params)
- abap_class_mapping(value)
Mapping the abap class.
- add_amdp_item(template_key, value)
Add item.
- add_amdp_name(amdp_name)
Add AMDP name.
- add_amdp_template(template_name)
Add AMDP template
- build_amdp_class()
After add_item, generate amdp file from template.
- get_amdp_notfillin_key()
Get AMDP not fillin keys.
- load_abap_class_mapping()
Load ABAP class mapping.
- load_amdp_template(template_name)
Load AMDP template
- plot(name='my_pipeline', iframe_height=450)
Plot a pipeline.
- Parameters:
- namestr, optional
Pipeline Name.
Defaults to "my_pipeline".
- iframe_heightint, optional
Height of iframe.
Defaults to 450.
- write_amdp_file(filepath=None, version=1, outdir='out')
Write template to file.
- generate_json_pipeline(pivot=False)
Generate the json formatted pipeline for pipeline fit function.
- create_amdp_class(amdp_name, training_dataset, apply_dataset)
Create AMDP class file. Then build_amdp_class can be called to generate amdp class.
- Parameters:
- amdp_namestr
Name of amdp.
- training_datasetstr
Name of training dataset.
- apply_datasetstr
Name of apply dataset.
- evaluate(data, key=None, features=None, label=None, categorical_variable=None, resampling_method=None, fold_num=None, random_state=None, endog=None, exog=None, gap_num=None, percentage=None)
Evaluation function for a pipeline.
- Parameters:
- dataDataFrame
SAP HANA DataFrame.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- resampling_methodcharacter, optional
The resampling method for pipeline model evaluation. For different pipeline, the options are different.
regressor: {'cv', 'stratified_cv'}
classifier: {'cv'}
timeseries: {'rocv', 'block', 'simple_split'}
Defaults to 'stratified_cv' if the estimator in
pipeline
is a classifier, and defaults to(and can only be) 'cv' if the estimator inpipeline
is a regressor, and defaults to 'rocv' if the estimator inpipeline
is a timeseries.- fold_numint, optional
The fold number for cross validation.
Defaults to 5.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.
Defaults to 0.
- endogstr, optional
Specifies the endogenous variable in time-series data. Please use
endog
instead oflabel
ifdata
is time-series.Defaults to the name of 1st non-key column in
data
.- exogstr or a list of str, optional
Specifies the exogenous variables in time-series data. Please use
exog
instead offeatures
ifdata
is time-series.Defaults to
the list of names of all non-key, non-endog columns in
data
if final estimator is not ExponentialSmoothing based[] otherwise.
- gap_numint, optional
Number of samples to exclude from the end of each train set before the test set. Valid only if the final estimator of the target pipeline is for time-series.
Defaults to 0.
- percentagefloat, optional
Percentage between training data and test data. Only applicable when the final estimator of the target pipeline is for time-series, and
resampling_method
is set as 'block'.Defaults to 0.7.
- Returns:
- DataFrame
1st column, NAME, Score name
2nd column, VALUE, Score value
- transform(data=None, key=None, features=None, label=None)
Transform function for a pipeline.
- Parameters:
- dataDataFrame, optional
SAP HANA DataFrame.
Defaults to None.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- Returns:
- The transformed DataFrame.
Inherited Methods from PALBase
Besides those methods mentioned above, the Pipeline class also inherits methods from PALBase class, please refer to PAL Base for more details.