hana_ml.algorithms.apl package

APL Package consists of the following sections:

hana_ml.algorithms.apl.gradient_boosting_classification

This module provides the SAP HANA APL gradient boosting classification algorithm.

The following classes are available:

class hana_ml.algorithms.apl.gradient_boosting_classification.GradientBoostingClassifier(conn_context=None, early_stopping_patience=None, eval_metric=None, learning_rate=None, max_depth=None, max_iterations=None, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: _GradientBoostingClassifierBase

SAP HANA APL Gradient Boosting Multiclass Classifier algorithm.

Parameters:
conn_contextConnectionContext, optional

The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().

early_stopping_patience: int, optional

If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. Please refer to APL documentation for default value.

eval_metric: str, optional

The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are 'MultiClassClassificationError' and 'MultiClassLogLoss'. Please refer to APL documentation for default value..

learning_rate: float, optional

The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. Please refer to APL documentation for default value.

max_depth: int, optional

The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. Please refer to APL documentation for default value.

max_iterations: int, optional

The maximum number of boosting iterations to fit the model. The default value is 1000.

number_of_jobs: int, optional

Deprecated.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.

extra_applyout_settings: dict, optional

Determines the output of the predict() method. The possible values are:

  • By default (None value): the default output.

    • <KEY>: the key column if it provided in the dataset

    • TRUE_LABEL: the class label if provided in the dataset

    • PREDICTED: the predicted label

    • PROBABILITY: the probability of the prediction (confidence)

  • {'APL/ApplyExtraMode': 'AllProbabilities'}: the probabilities for each class.

    • <KEY>: the key column if provided in the dataset

    • TRUE_LABEL: the class label if given in the dataset

    • PREDICTED: the predicted label

    • PROBA_<label_value1>: the probability for the class <label_value1>

    • ...

    • PROBA_<label_valueN>: the probability for the class <label_valueN>

  • {'APL/ApplyExtraMode': 'Individual Contributions'}: the feature importance for every sample

    • <KEY>: the key column if provided in the dataset

    • TRUE_LABEL: the class label when if provided in the dataset

    • PREDICTED: the predicted label

    • gb_contrib_<VAR1>: the contribution of the variable VAR1 to the score

    • ...

    • gb_contrib_<VARN>: the contribution of the variable VARN to the score

    • gb_contrib_constant_bias: the constant bias contribution to the score

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • 'max_tasks'

  • 'segment_column_name'

  • 'cutting_strategy'

  • 'interactions'

  • 'interactions_max_kept'

  • 'variable_auto_selection'

  • 'variable_selection_max_nb_of_final_variables'

  • 'variable_selection_max_iterations'

  • 'variable_selection_percentage_of_contribution_kept_by_step'

  • 'variable_selection_quality_bar'

See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.

For 'max_tasks', see FUNC_HEADER.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Users are free to input any possible value.

See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. The key is particularly useful to join the predictions output to the input dataset.

By default, if not provided, SAP HANA APL guesses the variable description by reading the first 100 rows. But, the results may be incorrect. The user can overwrite the guessed description by explicitly setting the variable_storages, variable_value_types and variable_missing_strings parameters. For example:

model.set_params(variable_storages={
    'ID': 'integer',
    'sepal length (cm)': 'number'
})
model.set_params(variable_value_types={
    'sepal length (cm)': 'continuous'
})
model.set_params(variable_missing_strings={
    'sepal length (cm)': '-1'
})

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification \
...     import GradientBoostingClassifier
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates hana_ml DataFrame
>>> hana_df = DataFrame(CONN,
                        'SELECT "id", "class", "capital-gain", '
                        '"native-country" from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = GradientBoostingClassifier()
>>> model.fit(hana_df, label='native-country', key='id')

Getting variable interactions

>>> model.set_params(other_train_apl_aliases={
...     'APL/Interactions': 'true',
...     'APL/InteractionsMaxKept': '3'
... })
>>> model.fit(data=self._df_train, key=self._key, label=self._label)
>>> # Checks interaction info in INDICATORS table
>>> output = model.get_indicators().filter("KEY LIKE 'Interaction%'").collect()

Debriefing

>>> # Global performance metrics of the model
>>> model.get_performance_metrics()
{'BalancedErrorRate': 0.9761904761904762, 'BalancedClassificationRate': 0.023809523809523808,
...
>>> # Performance metrics of the model for each class
>>> model.get_metrics_per_class()
{'Precision': {'Cambodia': 0.0, 'Canada': 0.0, 'China': 0.0, 'Columbia': 0.0...
>>> model.get_feature_importances()
{'Gain': OrderedDict([('class', 0.7713800668716431), ('capital-gain', 0.22861991822719574)])}

Generating the model report

>>> from hana_ml.visualizers.unified_report import UnifiedReport
>>> UnifiedReport(model).build().display()

Making predictions

>>> # Default output
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
    id     TRUE_LABEL      PREDICTED  PROBABILITY
0   30  United-States  United-States     0.89051
1   63  United-States  United-States     0.89051
2   66  United-States  United-States     0.89051
>>> # All probabilities
>>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'AllProbabilities'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
          id     TRUE_LABEL      PREDICTED      PROBA_?     PROBA_Cambodia  ...
35194  19272  United-States  United-States    0.016803            0.000595  ...
20186  39624  United-States  United-States    0.017564            0.001063  ...
43892  38759  United-States  United-States    0.019812            0.000353  ...
>>> # Individual Contributions
>>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
   id     TRUE_LABEL      PREDICTED  gb_contrib_class  gb_contrib_capital-gain  ...
0  30  United-States  United-States         -0.025366                -0.014416  ...
1  63  United-States  United-States         -0.025366                -0.014416  ...
2  66  United-States  United-States         -0.025366                -0.014416  ...

Saving the model in the schema named 'MODEL_STORAGE'

>>> from hana_ml.model_storage import ModelStorage
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My model name'
>>> model_storage.save_model(model=model, if_exists='replace')

Reloading the model for new predictions

>>> model2 = model_storage.load_model(name='My model name')
>>> out2 = model2.predict(data=hana_df)

Please see model_storage class for further features of model storage

Exporting the model in JSON format

>>> json_export = model.export_apply_code('JSON')

APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.

Attributes:
label: str

The target column name. This attribute is set when the fit() method is called.

model_: hana_ml DataFrame

The trained model content

summary_: APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the content of the model training summary.

indicators_: APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains various metrics related to the model and model variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_: APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_: hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table when a prediction was made

Methods

build_report([max_local_explanations])

Build model report.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type[, key, label, ...])

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.

fit(data[, key, features, label, weight, ...])

Fits the model.

generate_html_report(filename)

Save model report as a html file.

generate_notebook_iframe_report()

Render model report as a notebook iframe.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

get_debrief_report(report_name)

Retrieves a standard statistical report.

get_evalmetrics()

Returns the values of the evaluation metric at each iteration.

get_feature_importances()

Returns the feature importances.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_metrics_per_class()

Returns the performance for each class.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data[, prediction_type])

Makes predictions with the fitted model.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Saves the model into a table.

schedule_fit(output_table_name_model, ...)

Creates a HANA scheduler job for the model fitting.

schedule_predict(output_table_name_applyout, ...)

Creates a HANA scheduler job for the model prediction.

score(data)

Returns the accuracy score on the provided test dataset.

set_framework_version(framework_version)

Switch v1/v2 version of report.

set_metric_samplings([roc_sampling, ...])

Set metric samplings to report builder.

set_params(**parameters)

Sets attributes of the current model.

set_scale_out([route_to, no_route_to, ...])

Specifies hints for scaling-out environment.

set_shapley_explainer_of_predict_phase(...)

Use the reason code generated during the prediction phase to build a ShapleyExplainer instance.

set_shapley_explainer_of_score_phase(...[, ...])

Use the reason code generated during the scoring phase to build a ShapleyExplainer instance.

set_params(**parameters)

Sets attributes of the current model.

Parameters:
parameters: dict

The names and values of the attributes to change

fit(data, key=None, features=None, label=None, weight=None, build_report=False)

Fits the model.

Parameters:
dataDataFrame

The training dataset

keystr, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.

featureslist of str, optional

The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.

labelstr, optional

The name of the label column. Default is the last column.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

build_reportbool, optional

Whether to build report or not. Defaults to False.

Returns:
selfobject

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.

score(data)

Returns the accuracy score on the provided test dataset.

Parameters:
data: hana_ml DataFrame

The test dataset used to compute the score. The labels must be provided in the dataset.

Returns:
Float or pandas DataFrame

If no segment column is given, the accuracy score.

If a segment column is given, a pandas DataFrame which contains the accuracy score for each segment.

get_metrics_per_class()

Returns the performance for each class.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary.

If a segment column is given, a pandas DataFrame.

Examples

>>> data = DataFrame(conn, 'SELECT * from IRIS_MULTICLASSES')
>>> model = GradientBoostingClassifier(conn)
>>> model.fit(data=data, key='ID', label='LABEL')
>>> model.get_metrics_per_class()
{
'Precision': {
    'setosa': 1.0,
    'versicolor': 1.0,
    'virginica': 0.9743589743589743
},
'Recall': {
    'setosa': 1.0,
    'versicolor': 0.9714285714285714,
    'virginica': 1.0
},
'F1Score': {
    'setosa': 1.0,
    'versicolor': 0.9855072463768115,
    'virginica': 0.9870129870129869
}
build_report(max_local_explanations=100)

Build model report.

Parameters:
max_local_explanations: int, optional

The maximum number of local explanations displayed in the report.

set_metric_samplings(roc_sampling=None, other_samplings: dict = None)

Set metric samplings to report builder.

Parameters:
roc_samplingSampling, optional

ROC sampling.

other_samplingsdict, optional

Key is column name of metric table.

  • CUMGAINS

  • RANDOM_CUMGAINS

  • PERF_CUMGAINS

  • LIFT

  • RANDOM_LIFT

  • PERF_LIFT

  • CUMLIFT

  • RANDOM_CUMLIFT

  • PERF_CUMLIFT

Value is sampling.

Examples

Creating the metric samplings:

>>> roc_sampling = Sampling(method='every_nth', interval=2)
>>> other_samplings = dict(CUMGAINS=Sampling(method='every_nth', interval=2),
                      LIFT=Sampling(method='every_nth', interval=2),
                      CUMLIFT=Sampling(method='every_nth', interval=2))
>>> model.set_metric_samplings(roc_sampling, other_samplings)
disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.

Parameters:
code_type: str

The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).

key: str, optional

The name of the primary key column. Required for some code types.

label: str, optional

The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.

schema_name: str, optional

The schema name of the apply-in table. Required for some code types.

table_name: str, optional

The apply-in table name. Required for some code types.

other_params: dict, optional

The additional parameters to be included in the configuration. The available parameters are given in the developer guide.

Returns:
The exported code: str

Examples

Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)

>>> json_export = model.export_apply_code('JSON')

APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.

Exporting SQL apply code (available for Robust Regression and Clustering)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS')

Exporting SQL apply code (probability generated in the output)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS',
...     other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings',
...                   'APL/ApplyProba': 'true'})
generate_html_report(filename)

Save model report as a html file.

Parameters:
filenamestr

Html file name.

generate_notebook_iframe_report()

Render model report as a notebook iframe.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

Returns:
A pandas Dataframe with detailed information about the current version.

Notes

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

Returns:
int or pandas DataFrame

If no segment column is given, the best iteration.

If a segment column is given, a pandas DataFrame which contains the best iteration for each segment.

get_debrief_report(report_name)

Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.

Parameters:
report_name: str
Returns:
Statistical report: hana_ml DataFrame
get_evalmetrics()

Returns the values of the evaluation metric at each iteration. These values are based on the validation dataset.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary: {'<MetricName>': <List of values>}.

If a segment column is given, a pandas DataFrame which contains the evaluation metrics for each segment.

get_feature_importances()

Returns the feature importances.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary: { <importance_metric> : OrderedDictionary({ <feature_name> : <value> }) }.

If a segment column is given, a pandas DataFrame which contains the feature importances for each segment.

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns:
The reference to OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns:
The reference to INDICATORS tablehana_ml DataFrame

This table provides the performance metrics of the last model training

get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns:
list

List of HANA DataFrames respectively corresponding to the following tables:

  • Summary table

  • Variable roles table

  • Variable description table

  • Indicators table

  • Profit curves table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns:
The attribute-values of the modeldictionary
get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary with metric name as key and metric value as value.

If a segment column is given, a pandas DataFrame which contains the performance metrics for each segment.

Examples

>>> data = DataFrame(conn, 'SELECT * from APL_SAMPLES.CENSUS')
>>> model = GradientBoostingBinaryClassifier(conn)
>>> model.fit(data=data, key='id', label='class')
>>> model.get_performance_metrics()
{'AUC': 0.9385, 'PredictivePower': 0.8529, 'PredictionConfidence': 0.9759,...}
get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns:
The reference to the OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns:
The reference to the SUMMARY tablehana_ml DataFrame

This contains execution summary of the last model training

is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns:
bool

True if the model is ready to be saved.

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

predict(data, prediction_type=None)

Makes predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the 'prediction_type' parameter.

Parameters:
data: hana_ml DataFrame

The input dataset used for prediction

prediction_type: string, optional

Can be: - 'BestProbabilityAndDecision': return the probability value associated with the classification decision (default) - 'Decision': return the classification decision - 'Probability': return the probability that the row is a positive target (in binary classification) or the probabilities of all classes (in multiclass classification) - 'Score': return raw prediction scores - 'Individual Contributions': return SHAP values - 'Explanations': return strength indicators based on SHAP values

Returns:
Prediction output: hana_ml DataFrame
save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters:
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(artifactTable=myModel.indicators_,
...                       schema_name='MySchema',
...                       table_name='MyModel_Indicators',
...                       if_exists='replace')
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns:
None

The model is saved into a table with the following columns:

  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)

Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
output_table_name_model: str

The output table name for the model binary.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

output_table_name_indicators: str

The output table name for the model indicators.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.schedule_fit(

output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)

Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
input_table_name_model: str

The input table name for the model binary.

output_table_name_applyout: str

The output table name for the prediction data.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.predict(data)
>>> model.schedule_predict(

output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

set_framework_version(framework_version)

Switch v1/v2 version of report.

Parameters:
framework_version{'v2', 'v1'}, optional

v2: using report builder framework. v1: using pure html template.

Defaults to 'v2'.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.

Parameters:
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier
>>> model = GradientBoostingBinaryClassifier()
>>> # Routes the execution to a specific volume ID.
>>> model.set_scale_out(route_to=1025)
>>> # Routes the execution to a specific service type.
>>> # model.set_scale_out(route_to='computeserver')
>>> # Maps the execution to a specific workload class.
>>> # model.set_scale_out(workload_class="WC4")
>>> # Activates the sql trace
>>> # connection_context.sql_tracer.enable_sql_trace(True)
>>> model.fit(data=hdb_df, key='KEY', label='Y')
>>> # You can check whether the queries were effectively routed by querying:
>>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.
set_shapley_explainer_of_predict_phase(shapley_explainer, display_force_plot=True)

Use the reason code generated during the prediction phase to build a ShapleyExplainer instance.

When this instance is passed in, the execution results of this instance will be included in the report of v2 version.

Parameters:
shapley_explainerShapleyExplainer

A ShapleyExplainer instance.

display_force_plotbool, optional

Whether to display the force plot.

Defaults to True.

set_shapley_explainer_of_score_phase(shapley_explainer, display_force_plot=True)

Use the reason code generated during the scoring phase to build a ShapleyExplainer instance.

When this instance is passed in, the execution results of this instance will be included in the report of v2 version.

Parameters:
shapley_explainerShapleyExplainer

A ShapleyExplainer instance.

display_force_plotbool, optional

Whether to display the force plot.

Defaults to True.

class hana_ml.algorithms.apl.gradient_boosting_classification.GradientBoostingBinaryClassifier(conn_context=None, early_stopping_patience=None, eval_metric=None, learning_rate=None, max_depth=None, max_iterations=None, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: _GradientBoostingClassifierBase

SAP HANA APL Gradient Boosting Binary Classifier algorithm. It is very similar to GradientBoostingClassifier, the multiclass classifier. Its particularity lies in the provided metrics which are specific to binary classification.

Parameters:
conn_contextConnectionContext, optional

The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().

early_stopping_patience: int, optional

If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. Please refer to APL documentation for default value.

eval_metric: str, optional

The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are 'LogLoss','AUC' and 'ClassificationError'. Please refer to APL documentation for default value.

learning_rate: float, optional

The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. Please refer to APL documentation for default value.

max_depth: int, optional

The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. The default value is 4.

max_iterations: int, optional

The maximum number of boosting iterations to fit the model. Please refer to APL documentation for default value.

number_of_jobs: int, optional

Deprecated.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.

extra_applyout_settings: dict, optional

Determines the output of the predict() method. The possible values are:

  • By default (None value): the default output.

    • <KEY>: the key column if provided in the dataset

    • TRUE_LABEL: the class label if provided in the dataset

    • PREDICTED: the predicted label

    • PROBABILITY: the probability of the prediction (confidence)

  • {'APL/ApplyExtraMode': 'Individual Contributions'}: the individual contributions of each variable to the score. The output is:

    • <KEY>: the key column if provided in the dataset

    • TRUE_LABEL: the class label if provided in the dataset

    • gb_contrib_<VAR1>: the contribution of the variable VAR1 to the score

    • ...

    • gb_contrib_<VARN>: the contribution of the variable VARN to the score

    • gb_contrib_constant_bias: the constant bias contribution to the score

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • 'max_tasks'

  • 'segment_column_name'

  • 'correlations_lower_bound'

  • 'correlations_max_kept'

  • 'cutting_strategy'

  • 'target_key'

  • 'interactions'

  • 'interactions_max_kept'

  • 'variable_auto_selection'

  • 'variable_selection_max_nb_of_final_variables'

  • 'variable_selection_max_iterations'

  • 'variable_selection_percentage_of_contribution_kept_by_step'

  • 'variable_selection_quality_bar'

See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.

For 'max_tasks', see FUNC_HEADER.

other_train_apl_aliases: dict, optional

Contains the APL alias for model training. The list of possible aliases depends on the APL version.

See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification \
...     import GradientBoostingBinaryClassifier
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS)
>>> # -- Creates hana_ml DataFrame
>>> hana_df = DataFrame(CONN, 'SELECT * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(hana_df, label='class', key='id')

Getting variable interactions

>>> model.set_params(other_train_apl_aliases={
...     'APL/Interactions': 'true',
...     'APL/InteractionsMaxKept': '3'
... })
>>> model.fit(data=self._df_train, key=self._key, label=self._label)
>>> # Checks interaction info in INDICATORS table
>>> output = model.get_indicators().filter("KEY LIKE 'Interaction%'").collect()

Debriefing

>>> # Global performance metrics of the model
>>> model.get_performance_metrics()
{'LogLoss': 0.2567069689038737, 'PredictivePower': 0.8529, 'PredictionConfidence': 0.9759, ...}
>>> model.get_feature_importances()
{'Gain': OrderedDict([('relationship', 0.3866586685180664),
                      ('education-num', 0.1502334326505661)...

Generating the model report

>>> from hana_ml.visualizers.unified_report import UnifiedReport
>>> UnifiedReport(model).build().display()

Making predictions

>>> # Default output
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().sample(3) # returns the output as a pandas DataFrame
          id  TRUE_LABEL  PREDICTED  PROBABILITY
44903  41211           0          0    0.871326
47878  36020           1          1    0.993455
17549   6601           0          1    0.673872
>>> # Individual Contributions
>>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().sample(3) # returns the output as a pandas DataFrame
      id  TRUE_LABEL  gb_contrib_age  gb_contrib_workclass  gb_contrib_fnlwgt  ...
0  18448           0       -1.098452             -0.001238           0.060850  ...
1  18457           0       -0.731512             -0.000448           0.020060  ...
2  18540           0       -0.024523              0.027065           0.158083  ...

Saving the model in the schema named 'MODEL_STORAGE'. Please see model_storage class for further features of model storage.

>>> from hana_ml.model_storage import ModelStorage
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My model name'
>>> model_storage.save_model(model=model, if_exists='replace')

Reloading the model for new predictions

>>> model2 = model_storage.load_model(name='My model name')
>>> out2 = model2.predict(data=hana_df)

Exporting the model in JSON format

>>> json_export = model.export_apply_code('JSON')

APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.

Attributes:
label: str

The target column name. This attribute is set when the fit() method is called.

model_: hana_ml DataFrame

The trained model content

summary_: APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the content of the model training summary.

indicators_: APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains various metrics related to the model and model variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_: APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_: hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table when a prediction was made

Methods

build_report([max_local_explanations])

Build model report.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type[, key, label, ...])

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.

fit(data[, key, features, label, weight, ...])

Fits the model.

generate_html_report(filename)

Save model report as a html file.

generate_notebook_iframe_report()

Render model report as a notebook iframe.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

get_debrief_report(report_name)

Retrieves a standard statistical report.

get_evalmetrics()

Returns the values of the evaluation metric at each iteration.

get_feature_importances()

Returns the feature importances.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data[, prediction_type])

Makes predictions with the fitted model.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Saves the model into a table.

schedule_fit(output_table_name_model, ...)

Creates a HANA scheduler job for the model fitting.

schedule_predict(output_table_name_applyout, ...)

Creates a HANA scheduler job for the model prediction.

score(data)

Returns the accuracy score on the provided test dataset.

set_framework_version(framework_version)

Switch v1/v2 version of report.

set_metric_samplings([roc_sampling, ...])

Set metric samplings to report builder.

set_params(**parameters)

Sets attributes of the current model.

set_scale_out([route_to, no_route_to, ...])

Specifies hints for scaling-out environment.

set_shapley_explainer_of_predict_phase(...)

Use the reason code generated during the prediction phase to build a ShapleyExplainer instance.

set_shapley_explainer_of_score_phase(...[, ...])

Use the reason code generated during the scoring phase to build a ShapleyExplainer instance.

set_params(**parameters)

Sets attributes of the current model.

Parameters:
parameters: dict

The attribute names and values

score(data)

Returns the accuracy score on the provided test dataset.

Parameters:
data: hana_ml DataFrame

The test dataset used to compute the score. The labels must be provided in the dataset.

Returns:
Float or pandas DataFrame

If no segment column is given, the accuracy score.

If a segment column is given, a pandas DataFrame which contains the accuracy score for each segment.

build_report(max_local_explanations=100)

Build model report.

Parameters:
max_local_explanations: int, optional

The maximum number of local explanations displayed in the report.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.

Parameters:
code_type: str

The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).

key: str, optional

The name of the primary key column. Required for some code types.

label: str, optional

The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.

schema_name: str, optional

The schema name of the apply-in table. Required for some code types.

table_name: str, optional

The apply-in table name. Required for some code types.

other_params: dict, optional

The additional parameters to be included in the configuration. The available parameters are given in the developer guide.

Returns:
The exported code: str

Examples

Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)

>>> json_export = model.export_apply_code('JSON')

APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.

Exporting SQL apply code (available for Robust Regression and Clustering)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS')

Exporting SQL apply code (probability generated in the output)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS',
...     other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings',
...                   'APL/ApplyProba': 'true'})
fit(data, key=None, features=None, label=None, weight=None, build_report=False)

Fits the model.

Parameters:
dataDataFrame

The training dataset

keystr, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.

featureslist of str, optional

The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.

labelstr, optional

The name of the label column. Default is the last column.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

build_reportbool, optional

Whether to build report or not. Defaults to False.

Returns:
selfobject

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.

generate_html_report(filename)

Save model report as a html file.

Parameters:
filenamestr

Html file name.

generate_notebook_iframe_report()

Render model report as a notebook iframe.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

Returns:
A pandas Dataframe with detailed information about the current version.

Notes

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

Returns:
int or pandas DataFrame

If no segment column is given, the best iteration.

If a segment column is given, a pandas DataFrame which contains the best iteration for each segment.

get_debrief_report(report_name)

Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.

Parameters:
report_name: str
Returns:
Statistical report: hana_ml DataFrame
get_evalmetrics()

Returns the values of the evaluation metric at each iteration. These values are based on the validation dataset.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary: {'<MetricName>': <List of values>}.

If a segment column is given, a pandas DataFrame which contains the evaluation metrics for each segment.

get_feature_importances()

Returns the feature importances.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary: { <importance_metric> : OrderedDictionary({ <feature_name> : <value> }) }.

If a segment column is given, a pandas DataFrame which contains the feature importances for each segment.

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns:
The reference to OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns:
The reference to INDICATORS tablehana_ml DataFrame

This table provides the performance metrics of the last model training

get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns:
list

List of HANA DataFrames respectively corresponding to the following tables:

  • Summary table

  • Variable roles table

  • Variable description table

  • Indicators table

  • Profit curves table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns:
The attribute-values of the modeldictionary
get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary with metric name as key and metric value as value.

If a segment column is given, a pandas DataFrame which contains the performance metrics for each segment.

Examples

>>> data = DataFrame(conn, 'SELECT * from APL_SAMPLES.CENSUS')
>>> model = GradientBoostingBinaryClassifier(conn)
>>> model.fit(data=data, key='id', label='class')
>>> model.get_performance_metrics()
{'AUC': 0.9385, 'PredictivePower': 0.8529, 'PredictionConfidence': 0.9759,...}
get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns:
The reference to the OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns:
The reference to the SUMMARY tablehana_ml DataFrame

This contains execution summary of the last model training

is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns:
bool

True if the model is ready to be saved.

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

predict(data, prediction_type=None)

Makes predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the 'prediction_type' parameter.

Parameters:
data: hana_ml DataFrame

The input dataset used for prediction

prediction_type: string, optional

Can be: - 'BestProbabilityAndDecision': return the probability value associated with the classification decision (default) - 'Decision': return the classification decision - 'Probability': return the probability that the row is a positive target (in binary classification) or the probabilities of all classes (in multiclass classification) - 'Score': return raw prediction scores - 'Individual Contributions': return SHAP values - 'Explanations': return strength indicators based on SHAP values

Returns:
Prediction output: hana_ml DataFrame
save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters:
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(artifactTable=myModel.indicators_,
...                       schema_name='MySchema',
...                       table_name='MyModel_Indicators',
...                       if_exists='replace')
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns:
None

The model is saved into a table with the following columns:

  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)

Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
output_table_name_model: str

The output table name for the model binary.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

output_table_name_indicators: str

The output table name for the model indicators.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.schedule_fit(

output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)

Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
input_table_name_model: str

The input table name for the model binary.

output_table_name_applyout: str

The output table name for the prediction data.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.predict(data)
>>> model.schedule_predict(

output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

set_framework_version(framework_version)

Switch v1/v2 version of report.

Parameters:
framework_version{'v2', 'v1'}, optional

v2: using report builder framework. v1: using pure html template.

Defaults to 'v2'.

set_metric_samplings(roc_sampling: Sampling = None, other_samplings: dict = None)

Set metric samplings to report builder.

Parameters:
roc_samplingSampling, optional

ROC sampling.

other_samplingsdict, optional

Key is column name of metric table.

  • CUMGAINS

  • RANDOM_CUMGAINS

  • PERF_CUMGAINS

  • LIFT

  • RANDOM_LIFT

  • PERF_LIFT

  • CUMLIFT

  • RANDOM_CUMLIFT

  • PERF_CUMLIFT

Value is sampling.

Examples

Creating the metric samplings:

>>> roc_sampling = Sampling(method='every_nth', interval=2)
>>> other_samplings = dict(CUMGAINS=Sampling(method='every_nth', interval=2),
                      LIFT=Sampling(method='every_nth', interval=2),
                      CUMLIFT=Sampling(method='every_nth', interval=2))
>>> model.set_metric_samplings(roc_sampling, other_samplings)
set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.

Parameters:
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier
>>> model = GradientBoostingBinaryClassifier()
>>> # Routes the execution to a specific volume ID.
>>> model.set_scale_out(route_to=1025)
>>> # Routes the execution to a specific service type.
>>> # model.set_scale_out(route_to='computeserver')
>>> # Maps the execution to a specific workload class.
>>> # model.set_scale_out(workload_class="WC4")
>>> # Activates the sql trace
>>> # connection_context.sql_tracer.enable_sql_trace(True)
>>> model.fit(data=hdb_df, key='KEY', label='Y')
>>> # You can check whether the queries were effectively routed by querying:
>>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.
set_shapley_explainer_of_predict_phase(shapley_explainer, display_force_plot=True)

Use the reason code generated during the prediction phase to build a ShapleyExplainer instance.

When this instance is passed in, the execution results of this instance will be included in the report of v2 version.

Parameters:
shapley_explainerShapleyExplainer

A ShapleyExplainer instance.

display_force_plotbool, optional

Whether to display the force plot.

Defaults to True.

set_shapley_explainer_of_score_phase(shapley_explainer, display_force_plot=True)

Use the reason code generated during the scoring phase to build a ShapleyExplainer instance.

When this instance is passed in, the execution results of this instance will be included in the report of v2 version.

Parameters:
shapley_explainerShapleyExplainer

A ShapleyExplainer instance.

display_force_plotbool, optional

Whether to display the force plot.

Defaults to True.

hana_ml.algorithms.apl.gradient_boosting_regression

This module provides the SAP HANA APL gradient boosting regression algorithm.

The following classes are available:

class hana_ml.algorithms.apl.gradient_boosting_regression.GradientBoostingRegressor(conn_context=None, early_stopping_patience=None, eval_metric=None, learning_rate=None, max_depth=None, max_iterations=None, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: GradientBoostingBase, _UnifiedRegressionReportBuilder

SAP HANA APL Gradient Boosting Regression algorithm.

Parameters:
conn_contextConnectionContext, optional

The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().

early_stopping_patience: int, optional

If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. Please refer to APL documentation for default value.

eval_metric: str, optional

The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are 'MAE' and 'RMSE'. Please refer to APL documentation for default value.

learning_rate: float, optional

The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. Please refer to APL documentation for default value.

max_depth: int, optional

The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. Please refer to APL documentation for default value.

max_iterations: int, optional

The maximum number of boosting iterations to fit the model. Please refer to APL documentation for default value.

number_of_jobs: int, optional

Deprecated.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value types (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.

extra_applyout_settings: dict, optional

Determines the output of the predict() method. The possible values are:

  • By default (None value): the default output.

    • <KEY>: the key column if provided in the dataset

    • TRUE_LABEL: the actual value if provided

    • PREDICTED: the predicted value

  • {'APL/ApplyExtraMode': 'Individual Contributions'}: the feature importance for every sample

    • <KEY>: the key column if provided

    • TRUE_LABEL: the actual value if provided

    • PREDICTED: the predicted value

    • gb_contrib_<VAR1>: the contribution of the VAR1 variable to the score

    • ...

    • gb_contrib_<VARN>: the contribution of the VARN variable to the score

    • gb_contrib_constant_bias: the constant bias contribution

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • 'max_tasks'

  • 'segment_column_name'

  • 'correlations_lower_bound'

  • 'correlations_max_kept'

  • 'cutting_strategy'

  • 'interactions'

  • 'interactions_max_kept'

  • 'variable_auto_selection'

  • 'variable_selection_max_nb_of_final_variables'

  • 'variable_selection_max_iterations'

  • 'variable_selection_percentage_of_contribution_kept_by_step'

  • 'variable_selection_quality_bar'

See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.

For 'max_tasks', see FUNC_HEADER.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Users are free to input any possible value.

See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. The key is particularly useful to join the predictions output to the input dataset.

By default, if not provided, SAP HANA APL guesses the variable description by reading the first 100 rows. But, the results may be incorrect. The user can overwrite the guessed description by explicitly setting the variable_storages, variable_value_types and variable_missing_strings parameters. For example:

model.set_params(variable_storages={
    'ID': 'integer',
    'sepal length (cm)': 'number'
})
model.set_params(variable_value_types={
    'sepal length (cm)': 'continuous'
})
model.set_params(variable_missing_strings={
    'sepal length (cm)': '-1'
})

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_regression import GradientBoostingRegressor
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates hana_ml DataFrame
>>> hana_df = DataFrame(CONN,
...                     'SELECT "id", "class", "capital-gain", '
...                     '"native-country", "age" from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = GradientBoostingRegressor()
>>> model.fit(hana_df, label='age', key='id')

Getting variable interactions

>>> model.set_params(other_train_apl_aliases={
...     'APL/Interactions': 'true',
...     'APL/InteractionsMaxKept': '3'
... })
>>> model.fit(data=self._df_train, key=self._key, label=self._label)
>>> # Checks interaction info in INDICATORS table
>>> output = model.get_indicators().filter("KEY LIKE 'Interaction%'").collect()

Debriefing

>>> # Global performance metrics of the model
>>> model.get_performance_metrics()
{'L1': 7.31774, 'MeanAbsoluteError': 7.31774, 'L2': 9.42497, 'RootMeanSquareError': 9.42497, ...
>>> model.get_feature_importances()
{'Gain': OrderedDict([('class', 0.8728259801864624), ('capital-gain', 0.10493823140859604), ...

Generating the model report

>>> from hana_ml.visualizers.unified_report import UnifiedReport
>>> UnifiedReport(model).build().display()

Making predictions

>>> # Default output
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
          id  TRUE_LABEL  PREDICTED
39184  21772          27         25
16537   7331          33         43
7908   35226          65         42
>>> # Individual Contributions
>>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
     id  TRUE_LABEL  gb_contrib_workclass  gb_contrib_fnlwgt  gb_contrib_education  ...
0  6241          21             -1.330736          -0.385088              0.373539  ...
1  6248          18             -0.784536          -2.191791             -1.788672  ...
2  6253          26             -0.773891           0.358133             -0.185864  ...

Saving the model in the schema named 'MODEL_STORAGE'

>>> from hana_ml.model_storage import ModelStorage
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My model name'
>>> model_storage.save_model(model=model, if_exists='replace')

Reloading the model for new predictions

>>> model2 = model_storage.load_model(name='My model name')
>>> out2 = model2.predict(data=hana_df)

Please see model_storage class for further features of model storage

Exporting the model in JSON format

>>> json_export = model.export_apply_code('JSON')

APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.

Attributes:
label: str

The target column name. This attribute is set when the fit() method is called. Users don't need to set it explicitly, except if the model is loaded from a table. In this case, this attribute must be set before calling predict().

model_: hana_ml DataFrame

The trained model content

summary_: APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the content of the model training summary.

indicators_: APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains various metrics related to the model and model variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_: APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_: hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table when a prediction was made

Methods

build_report([max_local_explanations])

Build model report.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type[, key, label, ...])

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.

fit(data[, key, features, label, weight, ...])

Fits the model.

generate_html_report(filename)

Save model report as a html file.

generate_notebook_iframe_report()

Render model report as a notebook iframe.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

get_debrief_report(report_name)

Retrieves a standard statistical report.

get_evalmetrics()

Returns the values of the evaluation metric at each iteration.

get_feature_importances()

Returns the feature importances.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data[, prediction_type])

Generates predictions with the fitted model.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Saves the model into a table.

schedule_fit(output_table_name_model, ...)

Creates a HANA scheduler job for the model fitting.

schedule_predict(output_table_name_applyout, ...)

Creates a HANA scheduler job for the model prediction.

score(data)

Returns the R2 score (coefficient of determination) on the provided test dataset.

set_framework_version(framework_version)

Switch v1/v2 version of report.

set_params(**parameters)

Sets attributes of the current model.

set_scale_out([route_to, no_route_to, ...])

Specifies hints for scaling-out environment.

set_shapley_explainer_of_predict_phase(...)

Use the reason code generated during the prediction phase to build a ShapleyExplainer instance.

set_shapley_explainer_of_score_phase(...[, ...])

Use the reason code generated during the scoring phase to build a ShapleyExplainer instance.

set_params(**parameters)

Sets attributes of the current model.

Parameters:
parameters: dict

The attribute names and values

predict(data, prediction_type=None)

Generates predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the 'prediction_type' parameter.

Parameters:
data: hana_ml DataFrame

The input dataset used for prediction

prediction_type: string, optional

Can be: - 'Score': return predicted value (default) - 'Individual Contributions': return SHAP values - 'Explanations': return strength indicators based on SHAP values

Returns:
Prediction output: hana_ml DataFrame
score(data)

Returns the R2 score (coefficient of determination) on the provided test dataset.

Parameters:
data: hana_ml DataFrame

The test dataset used to compute the score. The labels must be provided in the dataset.

Returns:
Float or pandas DataFrame

If no segment column is given, the R2 score.

If a segment column is given, a pandas DataFrame which contains the R2 score for each segment.

build_report(max_local_explanations=100)

Build model report.

Parameters:
max_local_explanations: int, optional

The maximum number of local explanations displayed in the report.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.

Parameters:
code_type: str

The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).

key: str, optional

The name of the primary key column. Required for some code types.

label: str, optional

The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.

schema_name: str, optional

The schema name of the apply-in table. Required for some code types.

table_name: str, optional

The apply-in table name. Required for some code types.

other_params: dict, optional

The additional parameters to be included in the configuration. The available parameters are given in the developer guide.

Returns:
The exported code: str

Examples

Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)

>>> json_export = model.export_apply_code('JSON')

APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.

Exporting SQL apply code (available for Robust Regression and Clustering)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS')

Exporting SQL apply code (probability generated in the output)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS',
...     other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings',
...                   'APL/ApplyProba': 'true'})
fit(data, key=None, features=None, label=None, weight=None, build_report=False)

Fits the model.

Parameters:
dataDataFrame

The training dataset

keystr, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.

featureslist of str, optional

The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.

labelstr, optional

The name of the label column. Default is the last column.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

build_reportbool, optional

Whether to build report or not. Defaults to False.

Returns:
selfobject

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.

generate_html_report(filename)

Save model report as a html file.

Parameters:
filenamestr

Html file name.

generate_notebook_iframe_report()

Render model report as a notebook iframe.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

Returns:
A pandas Dataframe with detailed information about the current version.

Notes

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

Returns:
int or pandas DataFrame

If no segment column is given, the best iteration.

If a segment column is given, a pandas DataFrame which contains the best iteration for each segment.

get_debrief_report(report_name)

Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.

Parameters:
report_name: str
Returns:
Statistical report: hana_ml DataFrame
get_evalmetrics()

Returns the values of the evaluation metric at each iteration. These values are based on the validation dataset.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary: {'<MetricName>': <List of values>}.

If a segment column is given, a pandas DataFrame which contains the evaluation metrics for each segment.

get_feature_importances()

Returns the feature importances.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary: { <importance_metric> : OrderedDictionary({ <feature_name> : <value> }) }.

If a segment column is given, a pandas DataFrame which contains the feature importances for each segment.

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns:
The reference to OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns:
The reference to INDICATORS tablehana_ml DataFrame

This table provides the performance metrics of the last model training

get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns:
list

List of HANA DataFrames respectively corresponding to the following tables:

  • Summary table

  • Variable roles table

  • Variable description table

  • Indicators table

  • Profit curves table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns:
The attribute-values of the modeldictionary
get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary with metric name as key and metric value as value.

If a segment column is given, a pandas DataFrame which contains the performance metrics for each segment.

Examples

>>> data = DataFrame(conn, 'SELECT * from APL_SAMPLES.CENSUS')
>>> model = GradientBoostingBinaryClassifier(conn)
>>> model.fit(data=data, key='id', label='class')
>>> model.get_performance_metrics()
{'AUC': 0.9385, 'PredictivePower': 0.8529, 'PredictionConfidence': 0.9759,...}
get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns:
The reference to the OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns:
The reference to the SUMMARY tablehana_ml DataFrame

This contains execution summary of the last model training

is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns:
bool

True if the model is ready to be saved.

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters:
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(artifactTable=myModel.indicators_,
...                       schema_name='MySchema',
...                       table_name='MyModel_Indicators',
...                       if_exists='replace')
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns:
None

The model is saved into a table with the following columns:

  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)

Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
output_table_name_model: str

The output table name for the model binary.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

output_table_name_indicators: str

The output table name for the model indicators.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.schedule_fit(

output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)

Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
input_table_name_model: str

The input table name for the model binary.

output_table_name_applyout: str

The output table name for the prediction data.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.predict(data)
>>> model.schedule_predict(

output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

set_framework_version(framework_version)

Switch v1/v2 version of report.

Parameters:
framework_version{'v2', 'v1'}, optional

v2: using report builder framework. v1: using pure html template.

Defaults to 'v2'.

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.

Parameters:
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier
>>> model = GradientBoostingBinaryClassifier()
>>> # Routes the execution to a specific volume ID.
>>> model.set_scale_out(route_to=1025)
>>> # Routes the execution to a specific service type.
>>> # model.set_scale_out(route_to='computeserver')
>>> # Maps the execution to a specific workload class.
>>> # model.set_scale_out(workload_class="WC4")
>>> # Activates the sql trace
>>> # connection_context.sql_tracer.enable_sql_trace(True)
>>> model.fit(data=hdb_df, key='KEY', label='Y')
>>> # You can check whether the queries were effectively routed by querying:
>>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.
set_shapley_explainer_of_predict_phase(shapley_explainer, display_force_plot=True)

Use the reason code generated during the prediction phase to build a ShapleyExplainer instance.

When this instance is passed in, the execution results of this instance will be included in the report of v2 version.

Parameters:
shapley_explainerShapleyExplainer

ShapleyExplainer instance.

display_force_plotbool, optional

Whether to display the force plot.

Defaults to True.

set_shapley_explainer_of_score_phase(shapley_explainer, display_force_plot=True)

Use the reason code generated during the scoring phase to build a ShapleyExplainer instance.

When this instance is passed in, the execution results of this instance will be included in the report of v2 version.

Parameters:
shapley_explainerShapleyExplainer

ShapleyExplainer instance.

display_force_plotbool, optional

Whether to display the force plot.

Defaults to True.

hana_ml.algorithms.apl.time_series

This module contains the SAP HANA APL Time Series algorithm.

The following class is available:

class hana_ml.algorithms.apl.time_series.AutoTimeSeries(conn_context=None, time_column_name=None, target=None, horizon=1, with_extra_predictable=True, last_training_time_point=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, train_data_=None, sort_data=True, **other_params)

Bases: APLBase

SAP HANA APL Time Series algorithm.

Parameters:
target: str

The name of the column containing the time series data points.

time_column_name: str

The name of the column containing the time series time points. The time column is used as table key. It can be overridden by setting the 'key' parameter through the fit() method.

last_training_time_point: str, optional

The last time point used for model training. The training dataset will contain all data points up to this date. By default, this parameter will be set as the last time point until which the target is not null.

horizon: int, optional

The number of forecasts to be generated by the model upon apply. The time series model will be trained to optimize accuracy on the requested horizon only. The default value is 1.

with_extra_predictable: bool, optional

If set to true, all input variables will be used by the model to generate forecasts. If set to false, only the time and target columns will be used. All other variables will be ignored. This parameter is set to true by default.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value types (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals '???', it will be taken as missing.

extra_applyout_settings: dict, optional

Specifies the prediction outputs. See documentation on predict() method for more details.

sort_data: bool

If True, a temporary view is created on the dataset to sort data by time. However, users can provide directly a view with sorted dates. In this case, they must set sort_data to False to avoid creating a new view. The default value is True. WARNING: it is recommended to leave this parameter by default so the data is guaranteed to be read in sorted order. If the data is not sorted, the model will fail.

other_params: dict, optional

Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • 'max_tasks'

  • 'segment_column_name'

  • 'force_negative_forecast'

  • 'force_positive_forecast'

  • 'forecast_fallback_method'

  • 'forecast_max_cyclics'

  • 'forecast_max_lags'

  • 'forecast_method'

  • 'smoothing_cycle_length'

See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.

For 'max_tasks', see FUNC_HEADER.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value. There is no control in python.

Notes

The input dataset, given as an hana_ml dataframe, must not be a temporary table because the API tries to create a view sorted by the time column. SAP HANA does not allow user to create a view on temporary table. However, even though it is not recommended, to avoid creating the view, user can force the parameter sort_data to False.

When calling the fit_predict() method, the time series model is generated on the fly and not returned. If a model must be saved, please consider using the fit() method instead.

When extra-predictable variables are involved, it is usual to have a single dataset used both for the model training and the forecasting. In this case, the dataset should contain two successive periods:

  • The first one is used for the model training, ranging from the beginning to the last date where the target value is not null.

  • The second one is used for the model training, ranging from the the first date where the target value is null.

The content of the output of the get_performance_metrics() method may change depending of the version of SAP HANA APL used with this API. Please refer to the SAP HANA APL documentation to know which metrics will be provided.

Examples

>>> from hana_ml.algorithms.apl.time_series import AutoTimeSeries
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS)
>>> # -- Creates Hana DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CASHFLOWS_FULL')

Creating and fitting the model

>>> model = AutoTimeSeries(time_column_name='Date', target='Cash', horizon=3)
>>> model.fit(data=hana_df)

Debriefing

>>> model.get_model_components()
{'Trend': 'Polynom( Date)',
 'Cycles': 'PeriodicExtrasPred_MondayMonthInd',
 'Fluctuations': 'AR(46)'}
>>> model.get_performance_metrics()
{'MAPE': [0.12853715702893018, 0.12789963348617622, 0.12969031859857874], ...}

Generating forecasts using the forecast() method

This method is used to generate forecasts using a signature similar to the one used in PAL. There are two variants of usage as described below:

1) If the model does not use extra-predictable variables (no exogenous variable), users must simply specify the number of forecasts.

>>> train_df = DataFrame(CONN,
                        'SELECT "Date" , "Cash" '
                        'from APL_SAMPLES.CASHFLOWS_FULL ORDER BY 1 LIMIT 100')
>>> model = AutoTimeSeries(time_column_name='Date', target='Cash', horizon=3)
>>> model.fit(train_df)
>>> out = model.forecast(forecast_length=3)
>>> out.collect().tail(5)
           Date                            ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
98   2001-05-23  3057.812544999999772699132909775  4593.966530              NaN              NaN
99   2001-05-25  3037.539714999999887176132440567  4307.893346              NaN              NaN
100  2001-05-26                              None  4206.023158     -3609.599872     12021.646187
101  2001-05-27                              None  4575.162651     -3392.283802     12542.609104
102  2001-05-28                              None  4830.352462     -3239.507360     12900.212284

2) If the model uses extra-predictable variables, users must provide the values of all extra-predictable variables for each time point of the forecast period. These values must be provided as a hana_ml dataframe with the same structure as the training dataset.

>>> # Trains the dataset with extra-predictable variables
>>> train_df = DataFrame(CONN,
...                     'SELECT * '
...                     'from APL_SAMPLES.CASHFLOWS_FULL '
...                     'WHERE "Cash" is not null')
>>> # Extra-predictable variables' values on the forecast period
>>> forecast_df = DataFrame(CONN,
...                        'SELECT * '
...                        'from APL_SAMPLES.CASHFLOWS_FULL '
...                        'WHERE "Cash" is null LIMIT 5')
>>> model = AutoTimeSeries(time_column_name='Date', target='Cash', horizon=3)
>>> model.fit(train_df)
>>> out = model.forecast(data=forecast_df)
>>> out.collect().tail(5)
          Date ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
251  2001-12-29   None  6864.371407      -224.079492     13952.822306
252  2001-12-30   None  6889.515324      -211.264912     13990.295559
253  2001-12-31   None  6914.766513      -187.180923     14016.713949
254  2002-01-01   None  6940.124974              NaN              NaN
255  2002-01-02   None  6965.590706              NaN              NaN

Generating forecasts with the predict() method.

The predict() method allows users to apply a fitted model on a dataset different from the training dataset. For example, users can train a dataset on the first quarter (January to March) and apply the model on a dataset of different period (March to May).

>>> # Trains the model on the first quarter, from January to March
>>> train_df = DataFrame(CONN,
...                     'SELECT "Date" , "Cash" '
...                     'from APL_SAMPLES.CASHFLOWS_FULL '
...                     "where "Date" between '2001-01-01' and '2001-03-31'"
...                     " ORDER BY 1")
>>> model.fit(train_df)
>>> # Forecasts on a shifted period, from March to May
>>> test_df = DataFrame(CONN,
...                    'SELECT "Date", "Cash" '
...                    'from APL_SAMPLES.CASHFLOWS_FULL '
...                    "where "Date" between '2001-03-01' and '2001-05-31'"
...                    " ORDER BY 1")
>>> out = model.predict(test_df)
>>> out.collect().tail(5)
          Date                            ACTUAL     PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
60  2001-05-30  3837.196734000000105879735597214   4630.223083              NaN              NaN
61  2001-05-31  2911.884261000000151398126928726   4635.265982              NaN              NaN
62  2001-06-01                              None   4538.516542     -1087.461104     10164.494188
63  2001-06-02                              None   4848.815364     -5090.167255     14787.797983
64  2001-06-03                              None   4853.858263     -5138.553275     14846.269801

Using the fit_predict() method

This method enables the user to fit a model and generate forecasts on a single call, and thus get results faster. However, the model is created on the fly and deleted after use, so the user will not be able to save the resulting model.

>>> model.fit_predict(hana_df)
>>> out.collect().tail(5)
           Date            ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.42329499999  6055.761105              NaN              NaN
250  2001-12-28  7111.41669699999  6314.336098              NaN              NaN
251  2002-01-03                           None  7033.880804      4529.462710      9538.298899
252  2002-01-04                           None  6464.557223      3965.343397      8963.771049
253  2002-01-07                           None  6469.141663      3961.414900      8976.868427

Breaking down the time series into trend, cycles, fluctuations and residuals components.

If the parameter extra_applyout_settings is set to {'ExtraMode': True}, anytime a forecast method is called, predict(), forecast() or fit_predict(), the output will contain time series components and their corresponding residuals. The prediction columns are suffixed by the horizon number. For instance, 'Cycles_RESIDUALS_3' means the residual of the cycle component in the third horizon.

>>> model.fit(train_df)
>>> model.set_params(extra_applyout_settings={'ExtraMode': True})
>>> out = model.predict(hana_df)
>>> out.collect().tail(5)
               Date              ACTUAL        ...  Cycles_RESIDUALS_3  Fluctuations_RESIDUALS_3
249  2001-12-27  5995.42329499392507553        ...               32.51                  4.48e-13
250  2001-12-28  7111.41669699455205917        ...             -644.77                  1.14e-13
251  2002-01-03                    None        ...                 NaN                       NaN
252  2002-01-04                    None        ...                 NaN                       NaN
253  2002-01-07                    None        ...                 NaN                       NaN

Users can change the fields that are included in the output by using the APL/ApplyExtraMode alias in extra_applyout_settings, for instance: {'APL/ApplyExtraMode': 'First Forecast with Stable Components and Residues and Error Bars'}. Please check the SAP HANA APL documentation to know which values are available for APL/ApplyExtraMode. See Function Reference > Predictive Model Services > APPLY_MODEL > Advanced Apply Settings in the SAP HANA APL Developer Guide.

Attributes:
model_: hana_ml DataFrame

The trained model content

summary_: APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the summary about the model training.

indicators_: APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains the various metrics related to the model and its variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_: APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_: hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table that is produced when making predictions.

train_data_: hana_ml DataFrame

The train dataset

Methods

build_report([segment_name])

Build model report.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type[, key, label, ...])

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.

fit(data[, key, features, build_report])

Fits the model.

fit_predict(data[, key, features, horizon, ...])

Fits a model and generate forecasts in a single call to the FORECAST APL function.

forecast([forecast_length, data, build_report])

Uses the fitted model to generate out-of-sample forecasts.

generate_html_report(filename)

Save model report as a html file.

generate_notebook_iframe_report()

Render model report as a notebook iframe.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_debrief_report(report_name)

Retrieves a standard statistical report.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_horizon_wide_metric([metric_name])

Returns value of performance metric (MAPE, sMAPE, ...) averaged on the forecast horizon.

get_indicators()

Retrieves the Indicator table after model training.

get_model_components()

Returns the description of the model components, that is trend, cycles and fluctuations, used by the model to generate forecasts.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data[, apply_horizon, ...])

Uses the fitted model to generate forecasts.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Saves the model into a table.

schedule_fit(output_table_name_model, ...)

Creates a HANA scheduler job for the model fitting.

schedule_predict(output_table_name_applyout, ...)

Creates a HANA scheduler job for the model prediction.

set_params(**parameters)

Sets attributes of the current model.

set_scale_out([route_to, no_route_to, ...])

Specifies hints for scaling-out environment.

set_params(**parameters)

Sets attributes of the current model.

Parameters:
parameters: dict

Contains attribute names and values in the form of keyword arguments

fit(data, key=None, features=None, build_report=False)

Fits the model.

Parameters:
data: hana_ml DataFrame

The training dataset

key: str, optional

The column used as row identifier of the dataset. This column corresponds to the time column name. As a result, setting this parameter will overwrite the time_column_name model setting.

features: list of str, optional

The names of the feature columns, meaning the date column and the extra-predictive variables. If features is not provided, it defaults to all columns except the target column.

build_report: bool, optional

Whether to build report or not. Defaults to False.

Returns:
self: object
predict(data, apply_horizon=None, apply_last_time_point=None, build_report=False)

Uses the fitted model to generate forecasts.

Parameters:
data: hana_ml DataFrame

The input dataset used for predictions

apply_horizon: int, optional

The number of forecasts to generate. By default, the number of forecasts is the horizon on which the model was trained.

apply_last_time_point: str, optional

The time point corresponding to the start of the forecast period. Forecasts will be generated starting from the next time point after the 'apply_last_time_point'. By default, this parameter is set to the value of 'last_training_time_point' known from the model training.

build_report: bool, optional

Whether to build report or not. Defaults to False.

Returns:
hana_ml DataFrame

By default the output contains the following columns:

  • <the name of the time column>

  • ACTUAL: the actual value of time series

  • PREDICTED: the forecast value

  • LOWER_INT_95PCT: the lower limit of 95% confidence interval

  • UPPER_INT_95PCT: the upper limit of 95% confidence interval

If ExtraMode is set to true, the output dataframe will also contain the breaking down of the time series into a trend, cycles, fluctuations and residuals components.

Examples

Default output

>>> out = model.predict(hana_df)
>>> out.collect().tail(5)
       Date            ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.42329499999  6055.761105              NaN              NaN
250  2001-12-28  7111.41669699999  6314.336098              NaN              NaN
251  2002-01-03                           None  7033.88080      4529.46271      9538.29889
252  2002-01-04                           None  6464.55722      3965.34339      8963.77104
253  2002-01-07                           None  6469.14166      3961.41490      8976.86842

Retrieving forecasts and components (predicted, trend, cycles and fluctuations).

The output columns are suffixed with the horizon index. For example, Trend_1 means the trend component of the first horizon.

>>> model.set_params(extra_applyout_settings={'ExtraMode': True})
>>> out = model.predict(hana_df)
>>> out.collect().tail(5)
        Date                               ACTUAL  PREDICTED_1      Trend_1          249  2001-12-27  5995.423294999999598076101392507553  6055.761105  6814.405390   ...
250  2001-12-28  7111.416696999999658146407455205917  6314.336098  6839.334762   ...
251  2002-01-03                                 None  7033.880804  6991.163710   ...
252  2002-01-04                                 None  6464.557223  7016.843985   ...
253  2002-01-07                                 None  6469.141663  7094.528433   ...

Users can change the fields that are included in the output by using the APL/ApplyExtraMode alias in extra_applyout_settings, for instance: {'APL/ApplyExtraMode': 'First Forecast with Stable Components and Residues and Error Bars'}. Please check the SAP HANA APL documentation to know which values are available for APL/ApplyExtraMode. See Function Reference > Predictive Model Services > APPLY_MODEL > Advanced Apply Settings in the SAP HANA APL Developer Guide.

fit_predict(data, key=None, features=None, horizon=None, build_report=False)

Fits a model and generate forecasts in a single call to the FORECAST APL function. This method offers a faster way to perform the model training and forecasting.

However, the user will not have access to the model used internally since it is deleted after the computation of the forecasts.

Parameters:
data: hana_ml DataFrame

The input time series dataset

key: str, optional

The date column name. By default, it is equal to the model parameter time_column_name. If it is given, the model parameter time_column_name will be overwritten.

features: list of str, optional

The column names corresponding to the extra-predictable variables (exogenous variables). If features is not provided, it is equal to all columns except the target column.

horizon: int, optional

The number of forecasts to generate. The default value equals to the horizon parameter of the model.

build_reportbool, optional

Whether to build report or not. Defaults to False.

Returns:
hana_ml DataFrame

The output is the same as the predict() method.

forecast(forecast_length=None, data=None, build_report=False)

Uses the fitted model to generate out-of-sample forecasts. The model is supposed to be already fitted with a given dataset (training dataset). This method forecasts over a number of steps after the end of the training dataset. When there are extra-predictive variable (exogenous variables), the input parameter data is required. It must contain the values of the extra-predictable variables for the forecast period. If there is no extra-predictive variable, only the forecast_length parameter is needed.

Parameters:
forecast_length: int, optional

The number of forecasts to generate from the end of the train dataset. This parameter is by default the horizon specified in the model parameter.

data: hana_ml DataFrame, optional

The time series with extra-predictable variables used for forecasting. This parameter is required if extra-predictive variables are used in the model. When this parameter is given, the parameter 'forecast_length' is ignored.

build_reportbool, optional

Whether to build report or not. Defaults to False.

Returns:
hana_ml DataFrame

The output is the same as the predict() method.

Examples

Case where there is no extra-predictable variable:

>>> train_df = DataFrame(CONN,
                         'SELECT "Date" , "Cash" '
                         'from APL_SAMPLES.CASHFLOWS_FULL '
                         'where "Cash" is not null '
                         'ORDER BY 1')
>>> print(train_df.collect().tail(5))
            Date         Cash
246  2001-12-20  6382.441052
247  2001-12-21  5652.882539
248  2001-12-26  5081.372996
249  2001-12-27  5995.423295
250  2001-12-28  7111.416697
>>> model = AutoTimeSeries(CONN, time_column_name='Date',
                           target='Cash',
                           horizon=3)
>>> model.fit(train_df)
>>> out = model.forecast(forecast_length=3)
>>> out.collect().tail(5)
           Date                        ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.42329499999901392507553  6814.405390              NaN              NaN
250  2001-12-28  7111.41669699999907455205917  6839.334762              NaN              NaN
251  2001-12-29                          None  6864.371407      -224.079492     13952.822306
252  2001-12-30                          None  6889.515324      -211.264912     13990.295559
253  2001-12-31                          None  6914.766513      -187.180923     14016.713949

Case where there are extra-predictable variables:

>>> train_df = DataFrame(CONN,
                        'SELECT * '
                        'from APL_SAMPLES.CASHFLOWS_FULL '
                        'WHERE "Cash" is not null '
                        'ORDER BY 1')
>>> print(train_df.collect().tail(5))
           Date  WorkingDaysIndices     ...       BeforeLastWMonth         Cash
246  2001-12-20                  13     ...                      1  6382.441052
247  2001-12-21                  14     ...                      1  5652.882539
248  2001-12-26                  15     ...                      0  5081.372996
249  2001-12-27                  16     ...                      0  5995.423295
250  2001-12-28                  17     ...                      0  7111.416697
>>> # Extra-predictable variables to be provided as the forecast period
>>> forecast_df = DataFrame(CONN,
                           'SELECT * '
                           'from APL_SAMPLES.CASHFLOWS_FULL '
                           'WHERE "Cash" is null '
                           'ORDER BY 1 '
                           'LIMIT 3')
>>> print(forecast_df.collect())
         Date  WorkingDaysIndices  ...   BeforeLastWMonth  Cash
0  2002-01-03                   0  ...                  0  None
1  2002-01-04                   1  ...                  0  None
2  2002-01-07                   2  ...                  0  None
>>> model = AutoTimeSeries(CONN,
                           time_column_name='Date',
                           target='Cash',
                           horizon=3)
>>> model.fit(train_df)
>>> out = model.forecast(data=forecast_df)
>>> out.collect().tail(5)
           Date                          ACTUAL  PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.4232949999996101392507553    6814.41              NaN              NaN
250  2001-12-28  7111.4166969999996407455205917    6839.33              NaN              NaN
251  2001-12-29                            None    6864.37          -224.08         13952.82
252  2001-12-30                            None    6889.52          -211.26         13990.30
253  2001-12-31                            None    6914.77          -187.18         14016.71
get_model_components()

Returns the description of the model components, that is trend, cycles and fluctuations, used by the model to generate forecasts.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary with 3 possible keys: 'Trend', 'Cycles', 'Fluctuations'.

If a segment column is given, a pandas DataFrame which contains the model components for each segment.

Examples

>>> model.get_model_components()
{
    "Trend": "Linear(TIME)",
    "Cycles": None,
    "Fluctuations": "AR(36)"
}
get_performance_metrics()

Returns the performance metrics of the model. The metrics are provided for each forecast horizon.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary in which each metric is associated with a list containing <horizon> elements.

If a segment column is given, a pandas DataFrame which contains the metric values for each segment.

Examples

A model is trained with 4 horizons. The returned value will be:

>>> model.get_performance_metrics()
{'MAPE': [
      0.1529961017445385,
      0.1538823292343699,
      0.1564376267423695,
      0.15170398377407046}
get_horizon_wide_metric(metric_name='MAPE')

Returns value of performance metric (MAPE, sMAPE, ...) averaged on the forecast horizon.

Parameters:
metric_name: str

Default value equals 'MAPE'. Possible values: 'MAPE', 'MPE', 'MeanAbsoluteError', 'RootMeanSquareError', 'SMAPE', 'L1', 'L2', 'P2', 'R2', 'U2'

Returns:
Float or pandas DataFrame

If no segment column is given, the average metric value on the forecast horizon. It is based on validation partition.

If a segment column is given, a pandas DataFrame which contains the average metric value on the forecast horizon for each segment.

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters:
schema_name: str

The schema name

table_name: str

The table name

oidstr, optional

If the table contains several models, the OID must be given as an identifier. If it is not provided, the whole table is read.

Notes

Before using a reloaded model for a new prediction, set the following parameters again: 'time_column_name', 'target'. The SAP HANA ML library needs these parameters to prepare the dataset view. Otherwise, methods such as forecast() and predict() will fail.

Examples

>>> # Sets time_column_name and target again
>>> model = AutoTimeSeries(conn_context=CONN, time_column_name='Date', target='Cash')
>>> model.load_model(schema_name='MY_SCHEMA', table_name='MY_MODEL_TABLE')
>>> model.predict(hana_df,
...               apply_horizon=(NB_HORIZON_TRAIN + 5),
...               apply_last_time_point=LAST_TRAIN_DATE)
export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.

Parameters:
code_type: str

The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).

key: str, optional

The name of the primary key column. Required for some code types.

label: str, optional

The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.

schema_name: str, optional

The schema name of the apply-in table. Required for some code types.

table_name: str, optional

The apply-in table name. Required for some code types.

other_params: dict, optional

The additional parameters to be included in the configuration. The available parameters are given in the developer guide.

Returns:
The exported code: str

Examples

Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)

>>> json_export = model.export_apply_code('JSON')

APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.

Exporting SQL apply code (available for Robust Regression and Clustering)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS')

Exporting SQL apply code (probability generated in the output)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS',
...     other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings',
...                   'APL/ApplyProba': 'true'})
build_report(segment_name=None)

Build model report.

Parameters:
segment_name: str, optional

If the model is segmented, the segment name for which the report will be built.

generate_html_report(filename)

Save model report as a html file.

Parameters:
filenamestr

Html file name.

generate_notebook_iframe_report()

Render model report as a notebook iframe.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

Returns:
A pandas Dataframe with detailed information about the current version.

Notes

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_debrief_report(report_name)

Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.

Parameters:
report_name: str
Returns:
Statistical report: hana_ml DataFrame
get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns:
The reference to OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns:
The reference to INDICATORS tablehana_ml DataFrame

This table provides the performance metrics of the last model training

get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns:
list

List of HANA DataFrames respectively corresponding to the following tables:

  • Summary table

  • Variable roles table

  • Variable description table

  • Indicators table

  • Profit curves table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns:
The attribute-values of the modeldictionary
get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns:
The reference to the OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns:
The reference to the SUMMARY tablehana_ml DataFrame

This contains execution summary of the last model training

is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns:
bool

True if the model is ready to be saved.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters:
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(artifactTable=myModel.indicators_,
...                       schema_name='MySchema',
...                       table_name='MyModel_Indicators',
...                       if_exists='replace')
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns:
None

The model is saved into a table with the following columns:

  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)

Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
output_table_name_model: str

The output table name for the model binary.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

output_table_name_indicators: str

The output table name for the model indicators.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.schedule_fit(

output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)

Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
input_table_name_model: str

The input table name for the model binary.

output_table_name_applyout: str

The output table name for the prediction data.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.predict(data)
>>> model.schedule_predict(

output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.

Parameters:
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier
>>> model = GradientBoostingBinaryClassifier()
>>> # Routes the execution to a specific volume ID.
>>> model.set_scale_out(route_to=1025)
>>> # Routes the execution to a specific service type.
>>> # model.set_scale_out(route_to='computeserver')
>>> # Maps the execution to a specific workload class.
>>> # model.set_scale_out(workload_class="WC4")
>>> # Activates the sql trace
>>> # connection_context.sql_tracer.enable_sql_trace(True)
>>> model.fit(data=hdb_df, key='KEY', label='Y')
>>> # You can check whether the queries were effectively routed by querying:
>>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.

hana_ml.algorithms.apl.classification

Deprecated, use hana_ml.algorithms.apl.gradient_boosting_classification instead.

This module provides the SAP HANA APL binary classification algorithm.

The following classes are available:

class hana_ml.algorithms.apl.classification.AutoClassifier(conn_context=None, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: RobustRegressionBase

Deprecated, use GradientBoostingBinaryClassifier instead.

Legacy SAP HANA APL Binary Classifier algorithm.

Parameters:
conn_contextConnectionContext, optional

The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().

variable_auto_selectionbool, optional

When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables.

polynomial_degreeint, optional

The polynomial degree of the model. Default is 1.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.

extra_applyout_settings: dict optional

Defines other outputs the model should generate in addition to the predicted values. For example: {'APL/ApplyReasonCode':'3;Mean;Below;False'} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Developer Guide.

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • 'max_tasks'

  • 'segment_column_name'

  • 'correlations_lower_bound'

  • 'correlations_max_kept'

  • 'cutting_strategy'

  • 'exclude_low_predictive_confidence'

  • 'risk_fitting'

  • 'risk_fitting_min_cumulated_frequency'

  • 'risk_fitting_nb_pdo'

  • 'risk_fitting_use_weights'

  • 'risk_gdo'

  • 'risk_mode'

  • 'risk_pdo'

  • 'risk_score'

  • 'score_bins_count'

  • 'target_key'

  • 'variable_selection_best_iteration'

  • 'variable_selection_min_nb_of_final_variables'

  • 'variable_selection_max_nb_of_final_variables'

  • 'variable_selection_mode'

  • 'variable_selection_nb_variables_removed_by_step'

  • 'variable_selection_percentage_of_contribution_kept_by_step'

  • 'variable_selection_quality_bar'

  • 'variable_selection_quality_criteria'

See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.

For 'max_tasks', see FUNC_HEADER.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value. There is no control in python.

Notes

It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

By default, when it is not given, SAP HANA APL guesses the variable description by reading the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values in these parameters, the user can overwrite the default guess. For example:

model.set_params(variable_storages={
    'ID': 'integer',
    'sepal length (cm)': 'number'
})
model.set_params(variable_value_types={
    'sepal length (cm)': 'continuous'
})
model.set_params(variable_missing_strings={
    'sepal length (cm)': '-1'
})
model.set_params(extra_applyout_settings={
    'APL/ApplyReasonCode': '3;Mean;Below;False'
})

Examples

>>> from hana_ml.algorithms.apl.classification import AutoClassifier
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS)
>>> # -- Creates hana_ml DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoClassifier(variable_auto_selection=True)
>>> model.fit(hana_df, label='class', key='id')

Making the predictions

>>> apply_out = model.predict(hana_df)
>>> print(apply_out.head(3).collect())
    id  TRUE_LABEL  PREDICTED  PROBABILITY
0   30           0          0     0.688153
1   63           0          0     0.677693
2   66           0          0     0.700221

Adding individual contributions to the output of predictions

>>> model.set_params(extra_applyout_settings={'APL/ApplyContribution': 'all'})
>>> apply_out = model.predict(hana_df)
>>> print(apply_out.head(3).collect())
    id  TRUE_LABEL  PREDICTED  PROBABILITY  contrib_age_rr_class ...
0   30           0          0     0.688153              0.043387 ...
1   63           0          0     0.677693              0.042608 ...
2   66           0          0     0.700221              0.020784 ...

Adding reason codes to the output of predictions

>>> model.set_params(extra_applyout_settings={'APL/ApplyReasonCode': '3;Mean;Below;False'})
>>> apply_out = model.predict(hana_df)
>>> print(apply_out.head(3).collect())
   id  TRUE_LABEL  PREDICTED  PROBABILITY RCN_B_Mean_1_rr_class ...
0  30           0          0     0.688153         education-num ...
1  63           0          0     0.677693         education-num ...
2  66           0          0     0.700221         education-num ...

Debriefing

>>> model.get_performance_metrics()
OrderedDict([('L1', 0.2522171212463023), ('L2', 0.32254434028379236), ...
>>> model.get_feature_importances()
OrderedDict([('marital-status', 0.2172766583204266), ('capital-gain', 0.19521247617062215),...

Saving the model in the schema named 'MODEL_STORAGE'. Please see model_storage class for further features of model storage.

>>> from hana_ml.model_storage import ModelStorage
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My classification model name'
>>> model_storage.save_model(model=model, if_exists='replace')

Exporting the SQL apply code

>>> sql = model.export_apply_code(code_type='HANA',
...                               key='id',
...                               schema_name='APL_SAMPLES',
...                               table_name='CENSUS')
Attributes:
model_hana_ml DataFrame

The trained model content

summary_APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the content of the model training summary.

indicators_APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains various metrics related to model and model variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table when a prediction was made

Methods

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type[, key, label, ...])

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.

fit(data[, key, features, label, weight])

Fits the model.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_debrief_report(report_name)

Retrieves a standard statistical report.

get_feature_importances()

Returns the feature importances (MaximumSmartVariableContribution).

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data)

Makes predictions with the fitted model.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Saves the model into a table.

schedule_fit(output_table_name_model, ...)

Creates a HANA scheduler job for the model fitting.

schedule_predict(output_table_name_applyout, ...)

Creates a HANA scheduler job for the model prediction.

score(data)

Returns the accuracy score on the provided test dataset.

set_params(**parameters)

Sets attributes of the current model.

set_scale_out([route_to, no_route_to, ...])

Specifies hints for scaling-out environment.

fit(data, key=None, features=None, label=None, weight=None)

Fits the model.

Parameters:
dataDataFrame

The training dataset

keystr, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended usage. See notes below.

featureslist of str, optional

The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.

labelstr, optional

The name of the label column. Default is the last column.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

Returns:
selfobject

Notes

It is highly recommended to use a dataset with key in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with key, because the model will not expect it.

predict(data)

Makes predictions with the fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.

Parameters:
datahana_ml DataFrame

The dataset used for prediction

Returns:
Prediction output: hana_ml DataFrame

The dataframe contains the following columns:

  • KEY : the key column if it was provided in the dataset

  • TRUE_LABEL : the class label when it was given in the dataset

  • PREDICTED : the predicted label

  • PROBABILITY : the probability of the predicted label to be correct (confidence)

  • SCORING_VALUE : the unnormalized scoring value

score(data)

Returns the accuracy score on the provided test dataset.

Parameters:
data: hana_ml DataFrame

The test dataset used to compute the score. The labels must be provided in the dataset.

Returns:
Float or pandas DataFrame

If no segment column is given, the accuracy score.

If a segment column is given, a pandas DataFrame which contains the accuracy score for each segment.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.

Parameters:
code_type: str

The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).

key: str, optional

The name of the primary key column. Required for some code types.

label: str, optional

The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.

schema_name: str, optional

The schema name of the apply-in table. Required for some code types.

table_name: str, optional

The apply-in table name. Required for some code types.

other_params: dict, optional

The additional parameters to be included in the configuration. The available parameters are given in the developer guide.

Returns:
The exported code: str

Examples

Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)

>>> json_export = model.export_apply_code('JSON')

APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.

Exporting SQL apply code (available for Robust Regression and Clustering)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS')

Exporting SQL apply code (probability generated in the output)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS',
...     other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings',
...                   'APL/ApplyProba': 'true'})
get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

Returns:
A pandas Dataframe with detailed information about the current version.

Notes

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_debrief_report(report_name)

Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.

Parameters:
report_name: str
Returns:
Statistical report: hana_ml DataFrame
get_feature_importances()

Returns the feature importances (MaximumSmartVariableContribution).

Returns:
OrderedDict or pandas DataFrame

If no segment column is given, an OrderedDict: { feature_name : value }.

If a segment column is given, a pandas DataFrame which contains the feature importances for each segment.

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns:
The reference to OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns:
The reference to INDICATORS tablehana_ml DataFrame

This table provides the performance metrics of the last model training

get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns:
list

List of HANA DataFrames respectively corresponding to the following tables:

  • Summary table

  • Variable roles table

  • Variable description table

  • Indicators table

  • Profit curves table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns:
The attribute-values of the modeldictionary
get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns:
OrderedDict or pandas DataFrame

If no segment column is given, an OrderedDict with metric name as key and metric value as value.

If a segment column is given, a pandas DataFrame which contains the performance metrics for each segment.

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns:
The reference to the OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns:
The reference to the SUMMARY tablehana_ml DataFrame

This contains execution summary of the last model training

is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns:
bool

True if the model is ready to be saved.

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters:
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(artifactTable=myModel.indicators_,
...                       schema_name='MySchema',
...                       table_name='MyModel_Indicators',
...                       if_exists='replace')
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns:
None

The model is saved into a table with the following columns:

  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)

Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
output_table_name_model: str

The output table name for the model binary.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

output_table_name_indicators: str

The output table name for the model indicators.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.schedule_fit(

output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)

Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
input_table_name_model: str

The input table name for the model binary.

output_table_name_applyout: str

The output table name for the prediction data.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.predict(data)
>>> model.schedule_predict(

output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

set_params(**parameters)

Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.

Parameters:
paramsdictionary

The attribute names and values

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.

Parameters:
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier
>>> model = GradientBoostingBinaryClassifier()
>>> # Routes the execution to a specific volume ID.
>>> model.set_scale_out(route_to=1025)
>>> # Routes the execution to a specific service type.
>>> # model.set_scale_out(route_to='computeserver')
>>> # Maps the execution to a specific workload class.
>>> # model.set_scale_out(workload_class="WC4")
>>> # Activates the sql trace
>>> # connection_context.sql_tracer.enable_sql_trace(True)
>>> model.fit(data=hdb_df, key='KEY', label='Y')
>>> # You can check whether the queries were effectively routed by querying:
>>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.

hana_ml.algorithms.apl.regression

Deprecated, use hana_ml.algorithms.apl.gradient_boosting_regression instead.

This module contains SAP HANA APL regression algorithm.

The following classes are available:

class hana_ml.algorithms.apl.regression.AutoRegressor(conn_context=None, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: RobustRegressionBase

Deprecated, use GradientBoostingRegressor instead.

Legacy SAP HANA APL regression algorithm.

Parameters:
conn_contextConnectionContext, optional

The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().

variable_auto_selectionbool optional

When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables

polynomial_degreeint optional

The polynomial degree of the model. Default is 1.

variable_storages: dict optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.

extra_applyout_settings: dict optional

Defines other outputs the model should generate in addition to the predicted values. For example: {'APL/ApplyReasonCode':'3;Mean;Below;False'} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Developer Guide.

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • 'max_tasks'

  • 'segment_column_name'

  • 'correlations_lower_bound'

  • 'correlations_max_kept'

  • 'cutting_strategy'

  • 'exclude_low_predictive_confidence'

  • 'risk_fitting'

  • 'risk_fitting_min_cumulated_frequency'

  • 'risk_fitting_nb_pdo'

  • 'risk_fitting_use_weights'

  • 'risk_gdo'

  • 'risk_mode'

  • 'risk_pdo'

  • 'risk_score'

  • 'score_bins_count'

  • 'variable_auto_selection'

  • 'variable_selection_best_iteration'

  • 'variable_selection_min_nb_of_final_variables'

  • 'variable_selection_max_nb_of_final_variables'

  • 'variable_selection_mode'

  • 'variable_selection_nb_variables_removed_by_step'

  • 'variable_selection_percentage_of_contribution_kept_by_step'

  • 'variable_selection_quality_bar'

  • 'variable_selection_quality_criteria'

See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.

For 'max_tasks', see FUNC_HEADER.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value. There is no control in python.

Notes

It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

Examples

>>> from hana_ml.algorithms.apl.regression import AutoRegressor
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA Database

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates Hana DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoRegressor(variable_auto_selection=True)
>>> model.fit(hana_df, label='age', key='id' features=['workclass',
...                                                    'fnlwgt',
...                                                    'education',
...                                                    'education-num',
...                                                    'marital-status'])

Making a prediction

>>> applyout_df = model.predict(hana_df)
>>> print(applyout_df.head(5).collect())
          id  TRUE_LABEL  PREDICTED
0         30          49         42
1         63          48         42
2         66          36         42
3        110          42         42
4        335          53         42

Debriefing

>>> model.get_performance_metrics()
OrderedDict([('L1', 8.59885654599923), ('L2', 11.012352163260505)...
>>> model.get_feature_importances()
OrderedDict([('marital-status', 0.7916100739306074), ('education-num', 0.13524836400650087)

Saving the model in the schema named 'MODEL_STORAGE'. Please see model_storage class for further features of model storage

>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My regression model name'
>>> model_storage.save_model(model=model, if_exists='replace')

Reloading the model and making another prediction

>>> model2 = AutoRegressor(conn_context=CONN)
>>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df)
>>> applyout2.head(5).collect()
          id  TRUE_LABEL  PREDICTED
0         30          49         42
1         63          48         42
2         66          36         42
3        110          42         42
4        335          53         42

Exporting the SQL apply code

>>> sql = model.export_apply_code(code_type='HANA',
...                               key='id',
...                               schema_name='APL_SAMPLES',
...                               table_name='CENSUS')

Methods

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type[, key, label, ...])

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.

fit(data[, key, features, label, weight])

Fits the model.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_debrief_report(report_name)

Retrieves a standard statistical report.

get_feature_importances()

Returns the feature importances (MaximumSmartVariableContribution).

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data)

Makes prediction with a fitted model.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Saves the model into a table.

schedule_fit(output_table_name_model, ...)

Creates a HANA scheduler job for the model fitting.

schedule_predict(output_table_name_applyout, ...)

Creates a HANA scheduler job for the model prediction.

score(data)

Returns the R2 score (coefficient of determination) on the provided test dataset.

set_params(**parameters)

Sets attributes of the current model.

set_scale_out([route_to, no_route_to, ...])

Specifies hints for scaling-out environment.

fit(data, key=None, features=None, label=None, weight=None)

Fits the model.

Parameters:
dataDataFrame

The training dataset

keystr, optional

The name of the ID column. If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, default will be to all the non-ID and non-label columns.

labelstr, optional

The name of the label column. Default is the last column.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

Returns:
selfobject

Notes

It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with a key, because the model will not expect it.

predict(data)

Makes prediction with a fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.

Parameters:
datahana_ml DataFrame

The dataset used for prediction

Returns:
Prediction output: a hana_ml DataFrame.

The dataframe contains the following columns:

  • KEY : the key column if it was provided in the dataset

  • TRUE_LABEL : the true value if it was provided in the dataset

  • PREDICTED : the predicted value

score(data)

Returns the R2 score (coefficient of determination) on the provided test dataset.

Parameters:
data: hana_ml DataFrame

The test dataset used to compute the score. The labels must be provided in the dataset.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, the R2 score.

If a segment column is given, a pandas DataFrame which contains the R2 score for each segment.

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.

Parameters:
code_type: str

The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).

key: str, optional

The name of the primary key column. Required for some code types.

label: str, optional

The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.

schema_name: str, optional

The schema name of the apply-in table. Required for some code types.

table_name: str, optional

The apply-in table name. Required for some code types.

other_params: dict, optional

The additional parameters to be included in the configuration. The available parameters are given in the developer guide.

Returns:
The exported code: str

Examples

Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)

>>> json_export = model.export_apply_code('JSON')

APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.

Exporting SQL apply code (available for Robust Regression and Clustering)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS')

Exporting SQL apply code (probability generated in the output)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS',
...     other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings',
...                   'APL/ApplyProba': 'true'})
get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

Returns:
A pandas Dataframe with detailed information about the current version.

Notes

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_debrief_report(report_name)

Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.

Parameters:
report_name: str
Returns:
Statistical report: hana_ml DataFrame
get_feature_importances()

Returns the feature importances (MaximumSmartVariableContribution).

Returns:
OrderedDict or pandas DataFrame

If no segment column is given, an OrderedDict: { feature_name : value }.

If a segment column is given, a pandas DataFrame which contains the feature importances for each segment.

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns:
The reference to OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns:
The reference to INDICATORS tablehana_ml DataFrame

This table provides the performance metrics of the last model training

get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns:
list

List of HANA DataFrames respectively corresponding to the following tables:

  • Summary table

  • Variable roles table

  • Variable description table

  • Indicators table

  • Profit curves table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns:
The attribute-values of the modeldictionary
get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns:
OrderedDict or pandas DataFrame

If no segment column is given, an OrderedDict with metric name as key and metric value as value.

If a segment column is given, a pandas DataFrame which contains the performance metrics for each segment.

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns:
The reference to the OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns:
The reference to the SUMMARY tablehana_ml DataFrame

This contains execution summary of the last model training

is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns:
bool

True if the model is ready to be saved.

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters:
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(artifactTable=myModel.indicators_,
...                       schema_name='MySchema',
...                       table_name='MyModel_Indicators',
...                       if_exists='replace')
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns:
None

The model is saved into a table with the following columns:

  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)

Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
output_table_name_model: str

The output table name for the model binary.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

output_table_name_indicators: str

The output table name for the model indicators.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.schedule_fit(

output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)

Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
input_table_name_model: str

The input table name for the model binary.

output_table_name_applyout: str

The output table name for the prediction data.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.predict(data)
>>> model.schedule_predict(

output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

set_params(**parameters)

Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.

Parameters:
paramsdictionary

The attribute names and values

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.

Parameters:
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier
>>> model = GradientBoostingBinaryClassifier()
>>> # Routes the execution to a specific volume ID.
>>> model.set_scale_out(route_to=1025)
>>> # Routes the execution to a specific service type.
>>> # model.set_scale_out(route_to='computeserver')
>>> # Maps the execution to a specific workload class.
>>> # model.set_scale_out(workload_class="WC4")
>>> # Activates the sql trace
>>> # connection_context.sql_tracer.enable_sql_trace(True)
>>> model.fit(data=hdb_df, key='KEY', label='Y')
>>> # You can check whether the queries were effectively routed by querying:
>>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.

hana_ml.algorithms.apl.clustering

This module provides the SAP HANA APL clustering algorithms.

The following classes are available:

class hana_ml.algorithms.apl.clustering.AutoUnsupervisedClustering(conn_context=None, nb_clusters=None, nb_clusters_min=None, nb_clusters_max=None, distance=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: _AutoClusteringBase

SAP HANA APL unsupervised clustering algorithm.

Parameters:
nb_clustersint, optional, default = 10

The number of clusters to create

nb_clusters_min: int, optional

The minimum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.

nb_clusters_max: int, optional

The maximum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.

distance: str, optional, default = 'SystemDetermined'

The metric used to measure the distance between data points. The possible values are: 'L1', 'L2', 'LInf', 'SystemDetermined'.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value types (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals '???', it will be taken as missing.

extra_applyout_settings: dict optional

Defines the output to generate when applying the model. See documentation on predict() method for more information.

other_params: dict optional

Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • 'max_tasks'

  • 'segment_column_name'

  • 'calculate_cross_statistics'

  • 'calculate_sql_expressions'

  • 'cutting_strategy'

  • 'encoding_strategy'

See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.

For 'max_tasks', see FUNC_HEADER.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value in 'other_train_apl_aliases'. There is no control in python.

Notes

  • The algorithm may detect less clusters than requested. This happens when a cluster detected on the estimation dataset was not found on the validation dataset. In that case, this cluster will be considered unstable and will then be removed from the model. Users can get the number of clusters actually found in the "INDICATORS" table. For example,

    # The actual number of clusters found
    d = model_u.get_indicators().collect()
    d[d.KEY=='FinalClusterCount'][['KEY','VALUE']]
    
  • It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

  • By default, when it is not given, SAP HANA APL guesses the variable description by reading the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values for these parameters, the user can overwrite the default guess. For example:

    model.set_params(variable_storages={
        'ID': 'integer',
        'sepal length (cm)': 'number'
    })
    model.set_params(variable_value_types={
        'sepal length (cm)': 'continuous'
    })
    model.set_params(variable_missing_strings={
        'sepal length (cm)': '-1'
    })
    

Examples

>>> from hana_ml.algorithms.apl.clustering import AutoUnsupervisedClustering
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates Hana DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoUnsupervisedClustering(CONN, nb_clusters=5)
>>> model.fit(data=hana_df, key='id')

Debriefing

>>> model.get_metrics()
OrderedDict([('SimplifiedSilhouette', 0.3448029020802121), ('RSS', 4675.706587754118),...
>>> model.get_metrics_by_cluster()
{'Frequency': {1: 0.23053242076908276,
      2: 0.27434649954646656,
      3: 0.09628652318517908,
      4: 0.29919463456199663,
      5: 0.09963992193727494},
     'IntraInertia': {1: 0.6734978174937322,
      2: 0.7202839995396123,
      3: 0.5516800856975772,
      4: 0.6969632183111357,
      5: 0.5809322138167139},
     'RSS': {1: 5648.626195319932,
      2: 7189.15459940487,
      3: 1932.5353401986129,
      4: 7586.444631316713,
      5: 2105.879275085588},
     'SimplifiedSilhouette': {1: 0.1383827622819234,
      2: 0.14716862328457128,
      3: 0.18753797605134545,
      4: 0.13679980173383793,
      5: 0.15481377834381388},
     'KL': {1: OrderedDict([('relationship', 0.4951910610641741),
                   ('marital-status', 0.2776259711735807),
                   ('hours-per-week', 0.20990189265572687),
                   ('education-num', 0.1996353893520096),
                   ('education', 0.19963538935200956),
                   ...

Predicting which cluster a data point belongs to

>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  CLOSEST_CLUSTER_1  DISTANCE_TO_CLOSEST_CENTROID_1
0   30                  3                        0.640378
1   63                  4                        0.611050
2   66                  3                        0.640378
3  110                  4                        0.611050
4  335                  1                        0.851054

Determining the 2 closest clusters

>>> model.set_params(extra_applyout_settings={'mode':'closest_distances', 'nb_distances': 2})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  CLOSEST_CLUSTER_1  ...  CLOSEST_CLUSTER_2  DISTANCE_TO_CLOSEST_CENTROID_2
0   30                  3  ...                  4                        0.730330
1   63                  4  ...                  1                        0.851054
2   66                  3  ...                  4                        0.730330
3  110                  4  ...                  1                        0.851054
4  335                  1  ...                  4                        0.906003

Retrieving the distances to all clusters

>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  DISTANCE_TO_CENTROID_1               ... DISTANCE_TO_CENTROID_5
0   30                  3               ...      1.160697
1   63                  4               ...      1.160697
2   66                  3               ...      1.160697

Saving the model in the schema named 'MODEL_STORAGE'. Please model_storage class for further features of model storage.

>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My model name'
>>> model_storage.save_model(model=model)

Reloading the model for further use

>>> model2 = AutoUnsupervisedClustering(conn_context=CONN)
>>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df)
>>> applyout2.head(3).collect()
    id  CLOSEST_CLUSTER_1  DISTANCE_TO_CLOSEST_CENTROID_1
0   30                  3                        0.640378
1   63                  4                        0.611050
2   66                  3                        0.640378

Exporting the SQL apply code

>>> model = AutoUnsupervisedClustering(CONN, nb_clusters=5,
                                       calculate_sql_expressions='enabled')
>>> model.fit(data=hana_df, key='id')
>>> sql = model.export_apply_code(code_type='HANA',
...                               key='id',
...                               schema_name='APL_SAMPLES',
...                               table_name='CENSUS')
Attributes:
model_hana_ml DataFrame

The trained model content

summary_APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the summary about the model training.

indicators_APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains the various metrics related to the model and its variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table when a prediction was made

Methods

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type[, key, label, ...])

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.

fit(data[, key, features, weight])

Fits the model.

fit_predict(data[, key, features, weight])

Fits a clustering model and uses it to generate prediction output on the training dataset.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_debrief_report(report_name)

Retrieves a standard statistical report.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_metrics()

Returns metrics about the model.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data)

Predicts which cluster each specified row belongs to.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Saves the model into a table.

schedule_fit(output_table_name_model, ...)

Creates a HANA scheduler job for the model fitting.

schedule_predict(output_table_name_applyout, ...)

Creates a HANA scheduler job for the model prediction.

set_params(**parameters)

Sets attributes of the current model.

set_scale_out([route_to, no_route_to, ...])

Specifies hints for scaling-out environment.

fit(data, key=None, features=None, weight=None)

Fits the model.

Parameters:
datahana_ml DataFrame

The training dataset

keystr, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended.

featureslist of str, optional

The names of the features to be used in the model. If features is not provided, all columns will be taken except the ID column.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

Returns:
selfobject
fit_predict(data, key=None, features=None, weight=None)

Fits a clustering model and uses it to generate prediction output on the training dataset.

Parameters:
datahana_ml DataFrame

The input dataset

keystr, optional

The name of the ID column.

featureslist of str, optional.

The names of the feature columns. If features is not provided, all non-ID columns will be taken.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

Returns:
hana_ml DataFrame.

The output is the same as the predict() method.

Notes

Please see the predict() method so as to get different outputs with the 'extra_applyout_settings' parameter.

get_metrics()

Returns metrics about the model.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary object containing a set of clustering metrics and their values.

If a segment column is given, a pandas DataFrame which contains the metrics for each segment.

Examples

>>> model.get_metrics()
{'SimplifiedSilhouette': 0.14668968897882997,
 'RSS': 24462.640041325714,
 'IntraInertia': 3.2233573348587714,
 'KL': {1: OrderedDict([('hours-per-week', 0.2971627592049324),
             ('occupation', 0.11944355994892383),
             ('relationship', 0.06772624975990414),
             ('education-num', 0.06377345492340795),
             ('education', 0.06377345492340793),
             ...}
disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.

Parameters:
code_type: str

The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).

key: str, optional

The name of the primary key column. Required for some code types.

label: str, optional

The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.

schema_name: str, optional

The schema name of the apply-in table. Required for some code types.

table_name: str, optional

The apply-in table name. Required for some code types.

other_params: dict, optional

The additional parameters to be included in the configuration. The available parameters are given in the developer guide.

Returns:
The exported code: str

Examples

Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)

>>> json_export = model.export_apply_code('JSON')

APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.

Exporting SQL apply code (available for Robust Regression and Clustering)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS')

Exporting SQL apply code (probability generated in the output)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS',
...     other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings',
...                   'APL/ApplyProba': 'true'})
get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

Returns:
A pandas Dataframe with detailed information about the current version.

Notes

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_debrief_report(report_name)

Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.

Parameters:
report_name: str
Returns:
Statistical report: hana_ml DataFrame
get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns:
The reference to OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns:
The reference to INDICATORS tablehana_ml DataFrame

This table provides the performance metrics of the last model training

get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns:
list

List of HANA DataFrames respectively corresponding to the following tables:

  • Summary table

  • Variable roles table

  • Variable description table

  • Indicators table

  • Profit curves table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns:
The attribute-values of the modeldictionary
get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns:
The reference to the OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns:
The reference to the SUMMARY tablehana_ml DataFrame

This contains execution summary of the last model training

is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns:
bool

True if the model is ready to be saved.

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

predict(data)

Predicts which cluster each specified row belongs to.

Parameters:
datahana_ml DataFrame

The set of rows for which to generate cluster predictions. This dataset must have the same structure as the one used in the fit() method.

Returns:
hana_ml DataFrame

By default, the ID of the closest cluster and the distance to its center are provided. Users can request different outputs by setting the extra_applyout_settings parameter in the model. The extra_applyout_settings parameter is a dictionary with 'mode' and 'nb_distances' as keys. If mode is set to 'closest_distances', cluster IDs and distances to centroids will be provided from the closest to the furthest cluster. The output columns will be:

  • <The key column name>,

  • CLOSEST_CLUSTER_1,

  • DISTANCE_TO_CLOSEST_CENTROID_1,

  • CLOSEST_CLUSTER_2,

  • DISTANCE_TO_CLOSEST_CENTROID_2,

  • ...

If mode is set to 'all_distances', the distances to the centroids of all clusters will be provided in cluster ID order. The output columns will be:

  • ID,

  • DISTANCE_TO_CENTROID_1,

  • DISTANCE_TO_CENTROID_2,

  • ...

nb_distances limits the output to the closest clusters. It is only valid when mode is 'closest_distances' (it will be ignored if mode = 'all distances'). It can be set to 'all' or a positive integer.

Examples

Retrieves the IDs of the 3 closest clusters and the distances to their centroids:

>>> extra_applyout_settings = {'mode': 'closest_distances', 'nb_distances': '3'}
>>> model.set_params(extra_applyout_settings=extra_applyout_settings)
>>> out = model.predict(hana_df)
>>> out.head(3).collect()
            id  CLOSEST_CLUSTER_1  ...  CLOSEST_CLUSTER_3  DISTANCE_TO_CLOSEST_CENTROID_3
0   30                  3  ...                  4                        0.730330
1   63                  4  ...                  1                        0.851054
2   66                  3  ...                  4                        0.730330

Retrieves the distances to all clusters:

>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'})
>>> out = model.predict(hana_df)
>>> out.head(3).collect()
   id  DISTANCE_TO_CENTROID_1  DISTANCE_TO_CENTROID_2  ... DISTANCE_TO_CENTROID_5
0  30                0.994595                0.877414  ...              0.782949
1  63                0.994595                0.985202  ...              0.782949
2  66                0.994595                0.877414  ...              0.782949
save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters:
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(artifactTable=myModel.indicators_,
...                       schema_name='MySchema',
...                       table_name='MyModel_Indicators',
...                       if_exists='replace')
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns:
None

The model is saved into a table with the following columns:

  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)

Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
output_table_name_model: str

The output table name for the model binary.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

output_table_name_indicators: str

The output table name for the model indicators.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.schedule_fit(

output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)

Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
input_table_name_model: str

The input table name for the model binary.

output_table_name_applyout: str

The output table name for the prediction data.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.predict(data)
>>> model.schedule_predict(

output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

set_params(**parameters)

Sets attributes of the current model.

Parameters:
paramsdictionary

The set of parameters with their new values

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.

Parameters:
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier
>>> model = GradientBoostingBinaryClassifier()
>>> # Routes the execution to a specific volume ID.
>>> model.set_scale_out(route_to=1025)
>>> # Routes the execution to a specific service type.
>>> # model.set_scale_out(route_to='computeserver')
>>> # Maps the execution to a specific workload class.
>>> # model.set_scale_out(workload_class="WC4")
>>> # Activates the sql trace
>>> # connection_context.sql_tracer.enable_sql_trace(True)
>>> model.fit(data=hdb_df, key='KEY', label='Y')
>>> # You can check whether the queries were effectively routed by querying:
>>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.
class hana_ml.algorithms.apl.clustering.AutoSupervisedClustering(conn_context=None, label=None, nb_clusters=None, nb_clusters_min=None, nb_clusters_max=None, distance=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: _AutoClusteringBase

SAP HANA APL Supervised Clustering algorithm. Clusters are determined with respect to a label variable.

Parameters:
label: str,

The name of the label column

nb_clustersint, optional, default = 10

The number of clusters to create

nb_clusters_min: int, optional

The minimum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.

nb_clusters_max: int, optional

The maximum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.

distance: str, optional, default = 'SystemDetermined'

The metric used to measure the distance between data points. The possible values are: 'L1', 'L2', 'LInf', 'SystemDetermined'.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value types (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals '???', it will be taken as missing.

extra_applyout_settings: dict optional

Defines the output to generate when applying the model. See documentation on predict() method for more information.

other_params: dict optional

Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • 'max_tasks'

  • 'segment_column_name'

  • 'calculate_cross_statistics'

  • 'calculate_sql_expressions'

  • 'cutting_strategy'

  • 'encoding_strategy'

See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.

For 'max_tasks', see FUNC_HEADER.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value. There is no control in python.

Notes

  • The algorithm may detect less clusters than requested. This happens when a cluster detected on the estimation dataset was not found on the validation dataset. In that case, this cluster will be considered unstable and will then be removed from the model. Users can get the number of clusters actually found in the "INDICATORS" table. For example,

    # The actual number of clusters found
    d = model_u.get_indicators().collect()
    d[d.KEY=='FinalClusterCount'][['KEY','VALUE']]
    
  • It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

  • By default, when it is not given, SAP HANA APL guesses the variable description by reading the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values for these parameters, the user can overwrite the default guess. For example:

    model.set_params(variable_storages={
        'ID': 'integer',
        'sepal length (cm)': 'number'
    })
    model.set_params(variable_value_types={
        'sepal length (cm)': 'continuous'
    })
    model.set_params(variable_missing_strings={
        'sepal length (cm)': '-1'
    })
    

Examples

>>> from hana_ml.algorithms.apl.clustering import AutoSupervisedClustering
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates Hana DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoSupervisedClustering(nb_clusters=5)
>>> model.fit(data=hana_df, key='id', label='class')

Debriefing

>>> model.get_metrics()
OrderedDict([('SimplifiedSilhouette', 0.3448029020802121), ('RSS', 4675.706587754118),...

Predicting which cluster a data point belongs to

>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  CLOSEST_CLUSTER_1  DISTANCE_TO_CLOSEST_CENTROID_1
0   30                  3                        0.640378
1   63                  4                        0.611050
2   66                  3                        0.640378
3  110                  4                        0.611050
4  335                  1                        0.851054

Determining the 2 closest clusters

>>> model.set_params(extra_applyout_settings={'mode':'closest_distances', 'nb_distances': 2})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  CLOSEST_CLUSTER_1  ...  CLOSEST_CLUSTER_2  DISTANCE_TO_CLOSEST_CENTROID_2
0   30                  3  ...                  4                        0.730330
1   63                  4  ...                  1                        0.851054
2   66                  3  ...                  4                        0.730330
3  110                  4  ...                  1                        0.851054
4  335                  1  ...                  4                        0.906003

Retrieving the distances to all clusters

>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  DISTANCE_TO_CENTROID_1               ... DISTANCE_TO_CENTROID_5
0   30                0.851054               ...      1.160697
1   63                0.751054               ...      1.160697
2   66                0.906003               ...      1.160697

Saving the model in the schema named 'MODEL_STORAGE'. Please see model_storage class for further features of model storage.

>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My model name'
>>> model_storage.save_model(model=model, if_exists='replace')

Reloading the model for further uses Please note that the label has to be specified again prior to calling predict()

>>> model2 = AutoSupervisedClustering()
>>> model2.set_params(label='class')
>>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df)
>>> applyout2.head(3).collect()
    id  CLOSEST_CLUSTER_1  DISTANCE_TO_CLOSEST_CENTROID_1
0   30                  3                        0.640378
1   63                  4                        0.611050
2   66                  3                        0.640378

Exporting the SQL apply code

>>> model = AutoSupervisedClustering(CONN, nb_clusters=5,
                                     calculate_sql_expressions='enabled')
>>> model.fit(data=hana_df, key='id', label='class')
>>> sql = model.export_apply_code(code_type='HANA',
                                  key='id',
                                  schema_name='APL_SAMPLES',
                                  table_name='CENSUS')
Attributes:
model_hana_ml DataFrame

The trained model content

summary_APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the summary about the model training.

indicators_APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains the various metrics related to the model and its variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table when a prediction was made

Methods

disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type[, key, label, ...])

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.

fit(data[, key, label, features, weight])

Fits the model.

fit_predict(data[, key, label, features, weight])

Fits a clustering model and uses it to generate prediction output on the training dataset.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_debrief_report(report_name)

Retrieves a standard statistical report.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_metrics()

Returns metrics about the model.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data)

Predicts which cluster each specified row belongs to.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Saves the model into a table.

schedule_fit(output_table_name_model, ...)

Creates a HANA scheduler job for the model fitting.

schedule_predict(output_table_name_applyout, ...)

Creates a HANA scheduler job for the model prediction.

set_params(**parameters)

Sets attributes of the current model

set_scale_out([route_to, no_route_to, ...])

Specifies hints for scaling-out environment.

set_params(**parameters)

Sets attributes of the current model

Parameters:
paramsdictionary

containing attribute names and values

fit(data, key=None, label=None, features=None, weight=None)

Fits the model.

Parameters:
datahana_ml DataFrame

The training dataset

keystr, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended.

labelstr, option

The name of the label column. If it is not given, the model 'label' attribute will be taken. If this latter is not defined, an error will be raised.

featureslist of str, optional

The names of the features to be used in the model. If features is not provided, all columns will be taken except the ID and the label columns.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

Returns:
selfobject
predict(data)

Predicts which cluster each specified row belongs to.

Parameters:
datahana_ml DataFrame

The set of rows for which to generate cluster predictions. This dataset must have the same structure as the one used in the fit() method.

Returns:
hana_ml DataFrame

By default, the ID of the closest cluster and the distance to its center are provided. Users can request different outputs by setting the extra_applyout_settings parameter in the model. The extra_applyout_settings parameter is a dictionary with 'mode' and 'nb_distances' as keys. If mode is set to 'closest_distances', cluster IDs and distances to centroids will be provided from the closest to the furthest cluster. The output columns will be:

  • <The key column name>,

  • CLOSEST_CLUSTER_1,

  • DISTANCE_TO_CLOSEST_CENTROID_1,

  • CLOSEST_CLUSTER_2,

  • DISTANCE_TO_CLOSEST_CENTROID_2,

  • ...

If mode is set to 'all_distances', the distances to the centroids of all clusters will be provided in cluster ID order. The output columns will be:

  • ID,

  • DISTANCE_TO_CENTROID_1,

  • DISTANCE_TO_CENTROID_2,

  • ...

nb_distances limits the output to the closest clusters. It is only valid when mode is 'closest_distances' (it will be ignored if mode = 'all distances'). It can be set to 'all' or a positive integer.

Examples

Retrieves the IDs of the 3 closest clusters and the distances to their centroids:

>>> extra_applyout_settings = {'mode': 'closest_distances', 'nb_distances': 3}
>>> model.set_params(extra_applyout_settings=extra_applyout_settings)
>>> out = model.predict(hana_df)
>>> out.head(3).collect()
    id  CLOSEST_CLUSTER_1  ...  CLOSEST_CLUSTER_3  DISTANCE_TO_CLOSEST_CENTROID_3
0   30                  3  ...                  4                        0.730330
1   63                  4  ...                  1                        0.851054
2   66                  3  ...                  4                        0.730330

Retrieves the distances to all clusters:

>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'})
>>> out = model.predict(hana_df)
>>> out.head(3).collect()
   id  DISTANCE_TO_CENTROID_1  DISTANCE_TO_CENTROID_2  ... DISTANCE_TO_CENTROID_5
0  30                0.994595                0.877414  ...              0.782949
1  63                0.994595                0.985202  ...              0.782949
2  66                0.994595                0.877414  ...              0.782949
fit_predict(data, key=None, label=None, features=None, weight=None)

Fits a clustering model and uses it to generate prediction output on the training dataset.

Parameters:
datahana_ml DataFrame

The input dataset

keystr, optional

The name of the ID column

labelstr

The name of the label column

featureslist of str, optional.

The names of the feature columns. If features is not provided, all non-ID and non-label columns will be taken.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

Returns:
hana_ml DataFrame.

The output is the same as the predict() method.

Notes

Please see the predict() method so as to get different outputs with the 'extra_applyout_settings' parameter.

get_metrics()

Returns metrics about the model.

Returns:
Dictionary or pandas DataFrame

If no segment column is given, a dictionary object containing a set of clustering metrics and their values.

If a segment column is given, a pandas DataFrame which contains the metrics for each segment.

Examples

>>> model.get_metrics()
{'SimplifiedSilhouette': 0.14668968897882997,
 'RSS': 24462.640041325714,
 'IntraInertia': 3.2233573348587714,
 'Frequency': {
    1: 0.3167862345729914,
    2: 0.35590005772243755,
    3: 0.3273137077045711},
 'IntraInertia': {1: 0.7450335510518645,
     2: 0.708350629565789,
     3: 0.7006679558645009},
 'RSS': {1: 8586.511675872738,
     2: 9171.723951617836,
     3: 8343.554018434477},
 'SimplifiedSilhouette': {1: 0.13324659043317924,
     2: 0.14182734764281074,
     3: 0.1311620470933516},
 'TargetMean': {1: 0.1744734931009441,
      2: 0.022912917070469333,
      3: 0.3895408163265306},
 'TargetStandardDeviation': {1: 0.37951613049526484,
      2: 0.14962591788119842,
      3: 0.48764615116105525},
 'KL': {1: OrderedDict([('hours-per-week', 0.2971627592049324),
             ('occupation', 0.11944355994892383),
             ('relationship', 0.06772624975990414),
             ('education-num', 0.06377345492340795),
             ('education', 0.06377345492340793),
             ...
load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters:
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the OID must be given as an identifier. If it is not provided, the whole table is read.

Notes

Prior to using a reloaded model for a new prediction, it is necessary to re-specify the 'label' parameter. Otherwise, the predict() method will fail.

Examples

>>> # needs to re-specify time_column_name for view creation
>>> model = AutoTimeSeries(label='class')
>>> model.load_model(schema_name='MY_SCHEMA', table_name='MY_MODEL_TABLE')
>>> model.predict(hana_df)
disable_hana_execution()

HANA execution will be disabled and only SQL script will be generated.

enable_hana_execution()

HANA execution will be enabled.

export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)

Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.

Parameters:
code_type: str

The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).

key: str, optional

The name of the primary key column. Required for some code types.

label: str, optional

The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.

schema_name: str, optional

The schema name of the apply-in table. Required for some code types.

table_name: str, optional

The apply-in table name. Required for some code types.

other_params: dict, optional

The additional parameters to be included in the configuration. The available parameters are given in the developer guide.

Returns:
The exported code: str

Examples

Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)

>>> json_export = model.export_apply_code('JSON')

APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.

Exporting SQL apply code (available for Robust Regression and Clustering)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS')

Exporting SQL apply code (probability generated in the output)

>>> sql = model.export_apply_code(
...     code_type='HANA',
...     key='id',
...     schema_name='APL_SAMPLES',
...     table_name='CENSUS',
...     other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings',
...                   'APL/ApplyProba': 'true'})
get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

Returns:
A pandas Dataframe with detailed information about the current version.

Notes

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_artifacts_recorder()

Return the object recorder (for Design-time artifacts generation)

get_debrief_report(report_name)

Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.

Parameters:
report_name: str
Returns:
Statistical report: hana_ml DataFrame
get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns:
The reference to OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns:
The reference to INDICATORS tablehana_ml DataFrame

This table provides the performance metrics of the last model training

get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns:
list

List of HANA DataFrames respectively corresponding to the following tables:

  • Summary table

  • Variable roles table

  • Variable description table

  • Indicators table

  • Profit curves table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns:
The attribute-values of the modeldictionary
get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns:
The reference to the OPERATION_LOG tablehana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns:
The reference to the SUMMARY tablehana_ml DataFrame

This contains execution summary of the last model training

is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns:
bool

True if the model is ready to be saved.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters:
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(artifactTable=myModel.indicators_,
...                       schema_name='MySchema',
...                       table_name='MyModel_Indicators',
...                       if_exists='replace')
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Parameters:
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'

The behavior when the table already exists:

  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns:
None

The model is saved into a table with the following columns:

  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)

Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
output_table_name_model: str

The output table name for the model binary.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

output_table_name_indicators: str

The output table name for the model indicators.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.schedule_fit(

output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)

Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.

Parameters:
input_table_name_model: str

The input table name for the model binary.

output_table_name_applyout: str

The output table name for the prediction data.

output_table_name_log: str

The output table name for the log data.

output_table_name_summary: str

The output table name for the model summary.

schedule_kwargs: kwargs dictionary

Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.

Examples
-------
>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(

data=data, key= 'id', label='class',

)

>>> model.predict(data)
>>> model.schedule_predict(

output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)

set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)

Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.

Parameters:
route_tostr, optional

Routes the query to the specified volume ID or service type.

Defaults to None.

no_route_tostr or list of str, optional

Avoids query routing to a specified volume ID or service type.

Defaults to None.

route_bystr, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s).

Defaults to None.

route_by_cardinalitystr or list of str, optional

Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.

data_transfer_costint, optional

Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.

Defaults to None.

route_optimization_level{'minimal', 'all'}, optional

Guides the optimizer to compile with route_optimization_level 'minimal' or to default to route_optimization_level. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.

workload_classstr, optional

Routes the query via workload class. route_to statement hint has higher precedence than workload_class statement hint.

Defaults to None.

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier
>>> model = GradientBoostingBinaryClassifier()
>>> # Routes the execution to a specific volume ID.
>>> model.set_scale_out(route_to=1025)
>>> # Routes the execution to a specific service type.
>>> # model.set_scale_out(route_to='computeserver')
>>> # Maps the execution to a specific workload class.
>>> # model.set_scale_out(workload_class="WC4")
>>> # Activates the sql trace
>>> # connection_context.sql_tracer.enable_sql_trace(True)
>>> model.fit(data=hdb_df, key='KEY', label='Y')
>>> # You can check whether the queries were effectively routed by querying:
>>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.