Pipeline
- class hana_ml.algorithms.pal.pipeline.Pipeline(steps)
Pipeline construction to run transformers and estimators sequentially.
- Parameters:
- stepslist
List of (name, transform) tuples that are chained. The last object should be an estimator.
- Attributes:
fit_hdbprocedure
Returns the generated hdbprocedure for fit.
predict_hdbprocedure
Returns the generated hdbprocedure for predict.
Methods
abap_class_mapping
(value)Mapping the abap class.
add_amdp_item
(template_key, value)Add item.
add_amdp_name
(amdp_name)Add AMDP name.
add_amdp_template
(template_name)Add AMDP template
After add_item, generate amdp file from template.
create_amdp_class
(amdp_name, ...)Create AMDP class file.
It will disable mlflow autologging.
enable_mlflow_autologging
([schema, meta, ...])Enables mlflow autologging.
evaluate
(data[, key, features, label, ...])Evaluation function for a pipeline.
fit
(data[, key, features, label, ...])Fit function for a pipeline.
fit_predict
(data[, apply_data, fit_params, ...])Fit all the transforms one after the other and transform the data, then fit_predict the transformed data using the last estimator.
fit_transform
(data[, fit_params])Fit all the transforms one after the other and transform the data.
Generate the json formatted pipeline for auto-ml's pipeline_fit function.
Get AMDP not fillin keys.
Load ABAP class mapping.
load_amdp_template
(template_name)Load AMDP template
plot
([name, iframe_height])Plot a pipeline.
predict
(data[, key, features, model])Predict function for a pipeline.
score
(data[, key, features, label, model, ...])Score function for a fitted pipeline model.
write_amdp_file
([filepath, version, outdir])Write template to file.
- enable_mlflow_autologging(schema=None, meta=None, is_exported=False, registered_model_name=None)
Enables mlflow autologging. Only works for fit function.
- Parameters:
- schemastr, optional
Defines the model storage schema for mlflow autologging.
Defaults to the current schema.
- metastr, optional
Defines the model storage meta table for mlflow autologging.
Defaults to 'HANAML_MLFLOW_MODEL_STORAGE'.
- is_exportedbool, optional
Determines whether export the HANA model to mlflow.
Defaults to False.
- registered_model_namestr, optional
mlflow registered_model_name.
Defaults to None.
- disable_mlflow_autologging()
It will disable mlflow autologging.
- fit_transform(data, fit_params=None)
Fit all the transforms one after the other and transform the data.
- Parameters:
- dataDataFrame
SAP HANA DataFrame to be transformed in the pipeline.
- fit_paramsdict, optional
The parameters corresponding to the transformer's name where each parameter name is prefixed such that parameter p for step s has key s__p.
Defaults to None.
- Returns:
- DataFrame
The transformed SAP HANA DataFrame.
Examples
>>> my_pipeline = Pipeline([ ('PCA', PCA(scaling=True, scores=True)), ('imputer', Imputer(strategy='mean')) ]) >>> fit_params = {'PCA__key': 'ID', 'PCA__label': 'CLASS'} >>> my_pipeline.fit_transform(data=train_df, fit_params=fit_params)
- fit(data, key=None, features=None, label=None, fit_params=None, categorical_variable=None, generate_json_pipeline=False, use_pal_pipeline_fit=True, endog=None, exog=None, model_table_name=None)
Fit function for a pipeline.
- Parameters:
- dataDataFrame
SAP HANA DataFrame.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- fit_paramsdict, optional
Parameters corresponding to the transformers/estimator name where each parameter name is prefixed such that parameter p for step s has key s__p.
Defaults to None.
- categorical_variablestr or list of str, optional
Specify INTEGER column(s) that should be be treated as categorical data. Other INTEGER columns will be treated as continuous.
- generate_json_pipelinebool, optional
Help generate json formatted pipeline.
Defaults to False.
- use_pal_pipeline_fitbool, optional
Use PAL's pipeline fit function instead of the original chain execution.
Defaults to True.
- endogstr, optional
Specifies the endogenous variable in time-series data. Please use
endog
instead oflabel
ifdata
is time-series data.Defaults to the name of 1st non-key column in
data
.- exogstr or a list of str, optional
Specifies the exogenous variables in time-series data. Please use
exog
instead offeatures
ifdata
is time-series data.Defaults to
the list of names of all non-key, non-endog columns in
data
if final estimator is not ExponentialSmoothing based[] otherwise.
- model_table_namestr, optional
Specifies the HANA model table name instead of the generated temporary table.
Defaults to None.
Examples
>>> my_pipeline = Pipeline([ ('pca', PCA(scaling=True, scores=True)), ('imputer', Imputer(strategy='mean')), ('hgbt', HybridGradientBoostingClassifier( n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5, max_depth=6, cross_validation_range=cv_range)) ]) >>> fit_params = {'pca__key': 'ID', 'pca__label': 'CLASS', 'hgbt__key': 'ID', 'hgbt__label': 'CLASS', 'hgbt__categorical_variable': 'CLASS'} >>> hgbt_model = my_pipeline.fit(data=train_data, fit_params=fit_params)
- predict(data, key=None, features=None, model=None)
Predict function for a pipeline.
- Parameters:
- dataDataFrame
SAP HANA DataFrame.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or is indexed by multiple columns.Defaults to the index of
data
ifdata
is indexed by a single column.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- modelDataFrame, optional
The model to be used for prediction.
Defaults to the fitted model (model_).
- Returns:
- DataFrame
Predicted result, structured as follows:
1st column: Data type and name same as the 1st column of
data
.2nd column: SCORE, predicted values(for regression) or class labels(for classification).
- Attributes:
- predict_info_DataFrame
Structured as follows:
1st column: STAT_NAME.
2nd column: STAT_VALUE.
- score(data, key=None, features=None, label=None, model=None, random_state=None, top_k_attributions=None, sample_size=None, verbose_output=None)
Score function for a fitted pipeline model.
- Parameters:
- dataDataFrame
SAP HANA DataFrame.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- modelstr, optional
The trained model.
Defaults to self.model_.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.
Valid only when model table has background data information.
Defaults to 0.
- top_k_attributionsstr, optional
Outputs the attributions of top k features which contribute the most. Valid only when model table has background data information.
Defaults to 10.
- sample_sizeint, optional
Specifies the number of sampled combinations of features.
It is better to use a number that is greater than the number of features in
data
.If set as 0, it is determined by algorithm heuristically.
Defaults to 0.
- verbose_outputbool, optional
Specifies whether to output all classes and the corresponding confidences for each data.
True: Outputs the probability of all label categories.
False: Outputs the category of the highest probability only.
Valid only for classification.
Defaults to False.
- Returns:
- DataFrame 1
Prediction result, structured as follows:
1st column, ID of input data.
2nd column, SCORE, class assignment.
3rd column, REASON CODE, attribution of features.
4th & 5th column, placeholder columns for future implementations.
- DataFrame 2
Statistics, structured as follows:
1st column, STAT_NAME
2nd column, STAT_VALUE
- fit_predict(data, apply_data=None, fit_params=None, predict_params=None)
Fit all the transforms one after the other and transform the data, then fit_predict the transformed data using the last estimator.
- Parameters:
- dataDataFrame
SAP HANA DataFrame to be transformed in the pipeline.
- apply_dataDataFrame
SAP HANA DataFrame to be predicted in the pipeline.
Defaults to None.
- fit_paramsdict, optional
Parameters corresponding to the transformers/estimator name where each parameter name is prefixed such that parameter p for step s has key s__p.
Defaults to None.
- predict_paramsdict, optional
Parameters corresponding to the predictor name where each parameter name is prefixed such that parameter p for step s has key s__p.
Defaults to None.
- Returns:
- DataFrame
A SAP HANA DataFrame.
Examples
>>> my_pipeline = Pipeline([ ('pca', PCA(scaling=True, scores=True)), ('imputer', Imputer(strategy='mean')), ('hgbt', HybridGradientBoostingClassifier( n_estimators=4, split_threshold=0, learning_rate=0.5, fold_num=5, max_depth=6, cross_validation_range=cv_range)) ]) >>> fit_params = {'pca__key': 'ID', 'pca__label': 'CLASS', 'hgbt__key': 'ID', 'hgbt__label': 'CLASS', 'hgbt__categorical_variable': 'CLASS'} >>> hgbt_model = my_pipeline.fit_predict(data=train_data, apply_data=test_data, fit_params=fit_params)
- plot(name='my_pipeline', iframe_height=450)
Plot a pipeline.
- Parameters:
- namestr, optional
Pipeline Name.
Defaults to "my_pipeline".
- iframe_heightint, optional
Height of iframe.
Defaults to 450.
- generate_json_pipeline()
Generate the json formatted pipeline for auto-ml's pipeline_fit function.
- create_amdp_class(amdp_name, training_dataset, apply_dataset)
Create AMDP class file. Then build_amdp_class can be called to generate amdp class.
- Parameters:
- amdp_namestr
Name of amdp.
- training_datasetstr
Name of training dataset.
- apply_datasetstr
Name of apply dataset.
- evaluate(data, key=None, features=None, label=None, categorical_variable=None, resampling_method=None, fold_num=None, random_state=None)
Evaluation function for a pipeline.
- Parameters:
- dataDataFrame
SAP HANA DataFrame.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the name of the last non-ID column.
- categorical_variablestr or list of str, optional
Specify INTEGER column(s) that should be be treated as categorical data. Other INTEGER columns will be treated as continuous.
Defaults to None.
- resampling_methodcharacter, optional
The resampling method for pipeline model evaluation. For different pipeline, the options are different.
regressor: {'cv', 'stratified_cv'}
classifier: {'cv'}
timeseries: {'rocv', 'block', 'simple_split'}
Defaults to 'stratified_cv' if the estimator in
pipeline
is a classifier, and defaults to(and can only be) 'cv' if the estimator inpipeline
is a regressor, and defaults to 'rocv' if if the estimator inpipeline
is a timeseries.- fold_numint, optional
The fold number for cross validation.
Defaults to 5.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time (in seconds) as the seed.
Others: Uses the specified value as the seed.
Defaults to 0.
- Returns:
- DataFrame
1st column, NAME, Score name
2nd column, VALUE, Score value
- abap_class_mapping(value)
Mapping the abap class.
- add_amdp_item(template_key, value)
Add item.
- add_amdp_name(amdp_name)
Add AMDP name.
- add_amdp_template(template_name)
Add AMDP template
- build_amdp_class()
After add_item, generate amdp file from template.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- get_amdp_notfillin_key()
Get AMDP not fillin keys.
- load_abap_class_mapping()
Load ABAP class mapping.
- load_amdp_template(template_name)
Load AMDP template
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- write_amdp_file(filepath=None, version=1, outdir='out')
Write template to file.
Inherited Methods from PALBase
Besides those methods mentioned above, the Pipeline class also inherits methods from PALBase class, please refer to PAL Base for more details.