hana_ml.algorithms.apl package

APL Package consists of the following sections:

hana_ml.algorithms.apl.classification

This module provides the SAP HANA APL binary classification algorithm.

The following classes are available:

class hana_ml.algorithms.apl.classification.AutoClassifier(conn_context=None, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.robust_regression_base.RobustRegressionBase

SAP HANA APL Binary Classifier algorithm.

Parameters
conn_contextConnectionContext, optional

The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().

variable_auto_selectionbool, optional

When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables.

polynomial_degreeint, optional

The polynomial degree of the model. Default is 1.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.

extra_applyout_settings: dict optional

Defines other outputs the model should generate in addition to the predicted values. For example: {'APL/ApplyReasonCode':'3;Mean;Below;False'} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Reference Guide.

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • 'correlations_lower_bound'

  • 'correlations_max_kept'

  • 'cutting_strategy'

  • 'exclude_low_predictive_confidence'

  • 'risk_fitting'

  • 'risk_fitting_min_cumulated_frequency'

  • 'risk_fitting_nb_pdo'

  • 'risk_fitting_use_weights'

  • 'risk_gdo'

  • 'risk_mode'

  • 'risk_pdo'

  • 'risk_score'

  • 'score_bins_count'

  • 'target_key'

  • 'variable_selection_best_iteration'

  • 'variable_selection_min_nb_of_final_variables'

  • 'variable_selection_max_nb_of_final_variables'

  • 'variable_selection_mode'

  • 'variable_selection_nb_variables_removed_by_step'

  • 'variable_selection_percentage_of_contribution_kept_by_step'

  • 'variable_selection_quality_bar'

  • 'variable_selection_quality_criteria'

See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value. There is no control in python.

Notes

It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

By default, when it is not given, SAP HANA APL guesses the variable description by reading the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values in these parameters, the user can overwrite the default guess. For example:

model.set_params(
        variable_storages = {
            'ID': 'integer',
            'sepal length (cm)': 'number'
            })
model.set_params(
        variable_value_types = {
            'sepal length (cm)': 'continuous'
            })
model.set_params(
        variable_missing_strings = {
            'sepal length (cm)': '-1'
            })
model.set_params(
    extra_applyout_settings={
            'APL/ApplyReasonCode':'3;Mean;Below;False'
            })

Examples

>>> from hana_ml.algorithms.apl.classification import AutoClassifier
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS)
>>> # -- Creates hana_ml DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoClassifier(variable_auto_selection=True)
>>> model.fit(hana_df, label='class', key='id')

Making the predictions

>>> apply_out = model.predict(hana_df)
>>> print(apply_out.head(3).collect())
    id  TRUE_LABEL  PREDICTED  PROBABILITY
0   30           0          0     0.688153
1   63           0          0     0.677693
2   66           0          0     0.700221

Adding individual contributions to the output of predictions

>>> model.set_params(
...    extra_applyout_settings={
...        'APL/ApplyContribution': 'all'
...        })
>>> apply_out = model.predict(hana_df)
>>> print(apply_out.head(3).collect())
    id  TRUE_LABEL  PREDICTED  PROBABILITY  contrib_age_rr_class ...
0   30           0          0     0.688153              0.043387 ...
1   63           0          0     0.677693              0.042608 ...
2   66           0          0     0.700221              0.020784 ...

Adding reason codes to the output of predictions

>>> model.set_params(
...    extra_applyout_settings={
...        'APL/ApplyReasonCode':'3;Mean;Below;False'
...        })
>>> apply_out = model.predict(hana_df)
>>> print(apply_out.head(3).collect())
   id  TRUE_LABEL  PREDICTED  PROBABILITY RCN_B_Mean_1_rr_class ...
0  30           0          0     0.688153         education-num ...
1  63           0          0     0.677693         education-num ...
2  66           0          0     0.700221         education-num ...

Debriefing

>>> model.get_performance_metrics()
OrderedDict([('L1', 0.2522171212463023), ('L2', 0.32254434028379236), ...
>>> model.get_feature_importances()
OrderedDict([('marital-status', 0.2172766583204266), ('capital-gain', 0.19521247617062215),...

Saving the model in the schema named 'MODEL_STORAGE'. Please see model_storage class for further features of model storage.

>>> from hana_ml.model_storage import ModelStorage
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My classification model name'
>>> model_storage.save_model(model=model, if_exists='replace')
Attributes
model_hana_ml DataFrame

The trained model content

summary_APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the content of the model training summary.

indicators_APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains various metrics related to model and model variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table when a prediction was made

Methods

fit(data[, key, features, label, weight])

Fits the model.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

get_feature_importances()

Returns the feature importances (MaximumSmartVariableContribution).

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

predict(data)

Makes predictions with the fitted model.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

score(data)

Returns the mean accuracy on the provided test dataset.

set_params(**parameters)

Sets attributes of the current model.

fit(data, key=None, features=None, label=None, weight=None)

Fits the model.

Parameters
dataDataFrame

The training dataset

keystr, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended usage. See notes below.

featureslist of str, optional

The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.

labelstr, optional

The name of the label column. Default is the last column.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

Returns
selfobject

Notes

It is highly recommended to use a dataset with key in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with key, because the model will not expect it.

predict(data)

Makes predictions with the fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.

Parameters
datahana_ml DataFrame

The dataset used for prediction

Returns
Prediction output: hana_ml DataFrame
The dataframe contains the following columns:
- KEYthe key column if it was provided in the dataset
- TRUE_LABELthe class label when it was given in the dataset
- PREDICTEDthe predicted label
- PROBABILITYthe probability of the predicted label to be correct (confidence)
- SCORING_VALUEthe unnormalized scoring value
score(data)

Returns the mean accuracy on the provided test dataset.

Parameters
datahana_ml DataFrame

The test dataset used to compute the score. The labels must be provided in the dataset.

Returns
mean average accuracy: float
get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

A pandas Dataframe with detailed information about the current version.

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

Parameters
report_name: str
Returns
Statistical report: hana_ml DataFrame
get_feature_importances()

Returns the feature importances (MaximumSmartVariableContribution).

Returns
feature importancesAn OrderedDict { feature_name
get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns
The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
get_indicators()

Retrieves the Indicator table after model training.

Returns
The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns: list. List of HANA DataFrames respectively corresding to the following tables:

Summary table, Variable roles table, Variable description table, Indicators_table, Profit Curves_table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns
The attribute-values of the modeldictionary
get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns
An OrderedDict with metric name as key and metric value as value.
For example:
OrderedDict([('L1', 8.59885654599923),

('L2', 11.012352163260505), ('LInf', 67.0), ('ErrorMean', 0.33833594458645944), ...

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns
The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
get_summary()

Retrieves the summary table after model training.

Returns
The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(schema_name, table_name, oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Loads the model from a table.

Parameters
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns
None
The model is saved into a table with the following columns:
  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

set_params(**parameters)

Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.

Parameters
paramsdictionary

The attribute names and values

hana_ml.algorithms.apl.regression

This module contains SAP HANA APL regression algorithm.

The following classes are available:

class hana_ml.algorithms.apl.regression.AutoRegressor(conn_context=None, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.robust_regression_base.RobustRegressionBase

This module provides the SAP HANA APL regression algorithm.

Parameters
conn_contextConnectionContext, optional

The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().

variable_auto_selectionbool optional

When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables

polynomial_degreeint optional

The polynomial degree of the model. Default is 1.

variable_storages: dict optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.

extra_applyout_settings: dict optional

Defines other outputs the model should generate in addition to the predicted values. For example: {'APL/ApplyReasonCode':'3;Mean;Below;False'} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Reference Guide.

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • 'correlations_lower_bound'

  • 'correlations_max_kept'

  • 'cutting_strategy'

  • 'exclude_low_predictive_confidence'

  • 'risk_fitting'

  • 'risk_fitting_min_cumulated_frequency'

  • 'risk_fitting_nb_pdo'

  • 'risk_fitting_use_weights'

  • 'risk_gdo'

  • 'risk_mode'

  • 'risk_pdo'

  • 'risk_score'

  • 'score_bins_count'

  • 'variable_auto_selection'

  • 'variable_selection_best_iteration'

  • 'variable_selection_min_nb_of_final_variables'

  • 'variable_selection_max_nb_of_final_variables'

  • 'variable_selection_mode'

  • 'variable_selection_nb_variables_removed_by_step'

  • 'variable_selection_percentage_of_contribution_kept_by_step'

  • 'variable_selection_quality_bar'

  • 'variable_selection_quality_criteria'

See Common APL Aliases for Model Training in SAP HANA APL Reference Guide.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value. There is no control in python.

Notes

It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

Examples

>>> from hana_ml.algorithms.apl.regression import AutoRegressor
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA Database

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates Hana DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoRegressor(variable_auto_selection=True)
>>> model.fit(hana_df, label='age',
...      features=['workclass', 'fnlwgt', 'education', 'education-num', 'marital-status'],
...      key='id')

Making a prediction

>>> applyout_df = model.predict(hana_df)
>>> print(applyout_df.head(5).collect())
          id  TRUE_LABEL  PREDICTED
0         30          49         42
1         63          48         42
2         66          36         42
3        110          42         42
4        335          53         42

Debriefing

>>> model.get_performance_metrics()
OrderedDict([('L1', 8.59885654599923), ('L2', 11.012352163260505)...
>>> model.get_feature_importances()
OrderedDict([('marital-status', 0.7916100739306074), ('education-num', 0.13524836400650087)

Saving the model in the schema named 'MODEL_STORAGE' Please see model_storage class for further features of model storage

>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My regression model name'
>>> model_storage.save_model(model=model, if_exists='replace')

Reloading the model and making another prediction

>>> model2 = AutoRegressor(conn_context=CONN)
>>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df)
>>> applyout2.head(5).collect()
          id  TRUE_LABEL  PREDICTED
0         30          49         42
1         63          48         42
2         66          36         42
3        110          42         42
4        335          53         42

Methods

fit(data[, key, features, label, weight])

Fits the model.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

get_feature_importances()

Returns the feature importances (MaximumSmartVariableContribution).

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

predict(data)

Makes prediction with a fitted model.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

score(data)

Returns the coefficient of determination R^2 of the prediction.

set_params(**parameters)

Sets attributes of the current model.

fit(data, key=None, features=None, label=None, weight=None)

Fits the model.

Parameters
dataDataFrame

The training dataset

keystr, optional

The name of the ID column. If key is not provided, it is assumed that the input has no ID column.

featureslist of str, optional

Names of the feature columns. If features is not provided, default will be to all the non-ID and non-label columns.

labelstr, optional

The name of the label column. Default is the last column.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

Returns
selfobject

Notes

It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with a key, because the model will not expect it.

predict(data)

Makes prediction with a fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.

Parameters
datahana_ml DataFrame

The dataset used for prediction

Returns
Prediction output: a hana_ml DataFrame.
The dataframe contains the following columns:
- KEYthe key column if it was provided in the dataset
- TRUE_LABELthe true value if it was provided in the dataset
- PREDICTEDthe predicted value
score(data)

Returns the coefficient of determination R^2 of the prediction.

Parameters
datahana_ml DataFrame

The dataset used for prediction. It must contain the true value so that the score could be computed.

Returns
mean average accuracy: float
get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

A pandas Dataframe with detailed information about the current version.

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

Parameters
report_name: str
Returns
Statistical report: hana_ml DataFrame
get_feature_importances()

Returns the feature importances (MaximumSmartVariableContribution).

Returns
feature importancesAn OrderedDict { feature_name
get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns
The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
get_indicators()

Retrieves the Indicator table after model training.

Returns
The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns: list. List of HANA DataFrames respectively corresding to the following tables:

Summary table, Variable roles table, Variable description table, Indicators_table, Profit Curves_table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns
The attribute-values of the modeldictionary
get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns
An OrderedDict with metric name as key and metric value as value.
For example:
OrderedDict([('L1', 8.59885654599923),

('L2', 11.012352163260505), ('LInf', 67.0), ('ErrorMean', 0.33833594458645944), ...

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns
The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
get_summary()

Retrieves the summary table after model training.

Returns
The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(schema_name, table_name, oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Loads the model from a table.

Parameters
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns
None
The model is saved into a table with the following columns:
  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

set_params(**parameters)

Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.

Parameters
paramsdictionary

The attribute names and values

hana_ml.algorithms.apl.clustering

This module provides the SAP HANA APL clustering algorithms.

The following classes are available:

class hana_ml.algorithms.apl.clustering.AutoUnsupervisedClustering(conn_context=None, nb_clusters=None, nb_clusters_min=None, nb_clusters_max=None, distance=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.clustering._AutoClusteringBase

SAP HANA APL unsupervised clustering algorithm.

Parameters
nb_clustersint, optional, default = 10

The number of clusters to create

nb_clusters_min: int, optional

The minimum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.

nb_clusters_max: int, optional

The maximum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.

distance: str, optional, default = 'SystemDetermined'

The metric used to measure the distance between data points. The possible values are: 'L1', 'L2', 'LInf', 'SystemDetermined'.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value types (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals '???', it will be taken as missing.

extra_applyout_settings: dict optional

Defines the output to generate when applying the model. See documentation on predict() method for more information.

other_params: dict optional

Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • calculate_cross_statistics

  • calculate_sql_expressions

  • cutting_strategy

  • encoding_strategy

See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value in 'other_train_apl_aliases'. There is no control in python.

Notes

  • The algorithm may detect less clusters than requested.

This happens when a cluster detected on the estimation dataset was not found on the validation dataset. In that case, this cluster will be considered unstable and will then be removed from the model. Users can get the number of clusters actually found in the "INDICATORS" table. For example,

# The actual number of clusters found
d = model_u.get_indicators().collect()
d[d.KEY=='FinalClusterCount'][['KEY','VALUE']]
  • It is highly recommended to use a dataset with a key provided in the fit() method.

If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

  • By default, when it is not given, SAP HANA APL guesses the variable description by reading

the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values for these parameters, the user can overwrite the default guess. For example:

model.set_params(
        variable_storages = {
            'ID': 'integer',
            'sepal length (cm)': 'number'
            })
model.set_params(
        variable_value_types = {
            'sepal length (cm)': 'continuous'
            })
model.set_params(
        variable_missing_strings = {
            'sepal length (cm)': '-1'
            })

Examples

>>> from hana_ml.algorithms.apl.clustering import AutoUnsupervisedClustering
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates Hana DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoUnsupervisedClustering(CONN, nb_clusters=5)
>>> model.fit(data=hana_df, key='id')

Debriefing

>>> model.get_metrics()
OrderedDict([('SimplifiedSilhouette', 0.3448029020802121), ('RSS', 4675.706587754118),...
>>> model.get_metrics_by_cluster()
{'Frequency': {1: 0.23053242076908276,
      2: 0.27434649954646656,
      3: 0.09628652318517908,
      4: 0.29919463456199663,
      5: 0.09963992193727494},
     'IntraInertia': {1: 0.6734978174937322,
      2: 0.7202839995396123,
      3: 0.5516800856975772,
      4: 0.6969632183111357,
      5: 0.5809322138167139},
     'RSS': {1: 5648.626195319932,
      2: 7189.15459940487,
      3: 1932.5353401986129,
      4: 7586.444631316713,
      5: 2105.879275085588},
     'SimplifiedSilhouette': {1: 0.1383827622819234,
      2: 0.14716862328457128,
      3: 0.18753797605134545,
      4: 0.13679980173383793,
      5: 0.15481377834381388},
     'KL': {1: OrderedDict([('relationship', 0.4951910610641741),
                   ('marital-status', 0.2776259711735807),
                   ('hours-per-week', 0.20990189265572687),
                   ('education-num', 0.1996353893520096),
                   ('education', 0.19963538935200956),
                   ...

Predicting which cluster a data point belongs to

>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  CLOSEST_CLUSTER_1  DISTANCE_TO_CLOSEST_CENTROID_1
0   30                  3                        0.640378
1   63                  4                        0.611050
2   66                  3                        0.640378
3  110                  4                        0.611050
4  335                  1                        0.851054

Determining the 2 closest clusters

>>> model.set_params(extra_applyout_settings={'mode':'closest_distances', 'nb_distances': 2})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  CLOSEST_CLUSTER_1  ...  CLOSEST_CLUSTER_2  DISTANCE_TO_CLOSEST_CENTROID_2
0   30                  3  ...                  4                        0.730330
1   63                  4  ...                  1                        0.851054
2   66                  3  ...                  4                        0.730330
3  110                  4  ...                  1                        0.851054
4  335                  1  ...                  4                        0.906003

Retrieving the distances to all clusters

>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  DISTANCE_TO_CENTROID_1               ... DISTANCE_TO_CENTROID_5
0   30                  3               ...      1.160697
1   63                  4               ...      1.160697
2   66                  3               ...      1.160697

Saving the model in the schema named 'MODEL_STORAGE' Please model_storage class for further features of model storage.

>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My model name'
>>> model_storage.save_model(model=model)

Reloading the model for further use

>>> model2 = AutoUnsupervisedClustering(conn_context=CONN)
>>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df)
>>> applyout2.head(3).collect()
    id  CLOSEST_CLUSTER_1  DISTANCE_TO_CLOSEST_CENTROID_1
0   30                  3                        0.640378
1   63                  4                        0.611050
2   66                  3                        0.640378
Attributes
model_hana_ml DataFrame

The trained model content

summary_APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the summary about the model training.

indicators_APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains the various metrics related to the model and its variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table when a prediction was made

Methods

fit(data[, key, features, weight])

Fits the model.

fit_predict(data[, key, features, weight])

Fits a clustering model and uses it to generate prediction output on the training dataset.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_metrics()

Returns a dictionary containing the metrics about the model.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

predict(data)

Predicts which cluster each specified row belongs to.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

set_params(**parameters)

Sets attributes of the current model.

fit(data, key=None, features=None, weight=None)

Fits the model.

Parameters
datahana_ml DataFrame

The training dataset

keystr, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended.

featureslist of str, optional

The names of the features to be used in the model. If features is not provided, all columns will be taken except the ID column.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

Returns
selfobject
fit_predict(data, key=None, features=None, weight=None)

Fits a clustering model and uses it to generate prediction output on the training dataset.

Parameters
datahana_ml DataFrame

The input dataset

keystr, optional

The name of the ID column.

featureslist of str, optional.

The names of the feature columns. If features is not provided, all non-ID columns will be taken.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

Returns
hana_ml DataFrame.
The output is the same as the predict() method.

Notes

Please see the predict() method so as to get different outputs with the 'extra_applyout_settings' parameter.

get_metrics()

Returns a dictionary containing the metrics about the model.

Returns
A dictionary object containing a set of clustering metrics and their values

Examples

>>> model.get_metrics()
{'SimplifiedSilhouette': 0.14668968897882997,
 'RSS': 24462.640041325714,
 'IntraInertia': 3.2233573348587714,
 'KL': {1: OrderedDict([('hours-per-week', 0.2971627592049324),
             ('occupation', 0.11944355994892383),
             ('relationship', 0.06772624975990414),
             ('education-num', 0.06377345492340795),
             ('education', 0.06377345492340793),
             ...}
get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

A pandas Dataframe with detailed information about the current version.

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

Parameters
report_name: str
Returns
Statistical report: hana_ml DataFrame
get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns
The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
get_indicators()

Retrieves the Indicator table after model training.

Returns
The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns: list. List of HANA DataFrames respectively corresding to the following tables:

Summary table, Variable roles table, Variable description table, Indicators_table, Profit Curves_table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns
The attribute-values of the modeldictionary
get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns
The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
get_summary()

Retrieves the summary table after model training.

Returns
The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(schema_name, table_name, oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Loads the model from a table.

Parameters
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

predict(data)

Predicts which cluster each specified row belongs to.

Parameters
datahana_ml DataFrame

The set of rows for which to generate cluster predictions. This dataset must have the same structure as the one used in the fit() method.

Returns
hana_ml DataFrame

By default, the ID of the closest cluster and the distance to its center are provided. Users can request different outputs by setting the extra_applyout_settings parameter in the model. The extra_applyout_settings parameter is a dictionary with 'mode' and 'nb_distances' as keys. If mode is set to 'closest_distances', cluster IDs and distances to centroids will be provided from the closest to the furthest cluster. The output columns will be:

  • <The key column name>,

  • CLOSEST_CLUSTER_1,

  • DISTANCE_TO_CLOSEST_CENTROID_1,

  • CLOSEST_CLUSTER_2,

  • DISTANCE_TO_CLOSEST_CENTROID_2,

...

If mode is set to 'all_distances', the distances to the centroids of all clusters will be provided in cluster ID order. The output columns will be:

  • ID,

  • DISTANCE_TO_CENTROID_1,

  • DISTANCE_TO_CENTROID_2,

...

nb_distances limits the output to the closest clusters. It is only valid when mode is 'closest_distances' (it will be ignored if mode = 'all distances'). It can be set to 'all' or a positive integer.

Examples

Retrieves the IDs of the 3 closest clusters and the distances to their centroids:

>>> extra_applyout_settings = {'mode': 'closest_distances', 'nb_distances': '3'}
>>> model.set_params(extra_applyout_settings=extra_applyout_settings)
>>> out = model.predict(hana_df)
>>> out.head(3).collect()
            id  CLOSEST_CLUSTER_1  ...  CLOSEST_CLUSTER_3  DISTANCE_TO_CLOSEST_CENTROID_3
0   30                  3  ...                  4                        0.730330
1   63                  4  ...                  1                        0.851054
2   66                  3  ...                  4                        0.730330

Retrieves the distances to all clusters:

>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'})
>>> out = model.predict(hana_df)
>>> out.head(3).collect()
   id  DISTANCE_TO_CENTROID_1  DISTANCE_TO_CENTROID_2  ... DISTANCE_TO_CENTROID_5
0  30                0.994595                0.877414  ...              0.782949
1  63                0.994595                0.985202  ...              0.782949
2  66                0.994595                0.877414  ...              0.782949
save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns
None
The model is saved into a table with the following columns:
  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

set_params(**parameters)

Sets attributes of the current model.

Parameters
paramsdictionary

The set of parameters with their new values

class hana_ml.algorithms.apl.clustering.AutoSupervisedClustering(conn_context=None, label=None, nb_clusters=None, nb_clusters_min=None, nb_clusters_max=None, distance=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.clustering._AutoClusteringBase

SAP HANA APL Supervised Clustering algorithm. Clusters are determined with respect to a label variable.

Parameters
label: str,

The name of the label column

nb_clustersint, optional, default = 10

The number of clusters to create

nb_clusters_min: int, optional

The minimum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.

nb_clusters_max: int, optional

The maximum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.

distance: str, optional, default = 'SystemDetermined'

The metric used to measure the distance between data points. The possible values are: 'L1', 'L2', 'LInf', 'SystemDetermined'.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value types (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals '???', it will be taken as missing.

extra_applyout_settings: dict optional

Defines the output to generate when applying the model. See documentation on predict() method for more information.

other_params: dict optional

Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • calculate_cross_statistics

  • calculate_sql_expressions

  • cutting_strategy

  • encoding_strategy

See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value. There is no control in python.

Notes

  • The algorithm may detect less clusters than requested.

This happens when a cluster detected on the estimation dataset was not found on the validation dataset. In that case, this cluster will be considered unstable and will then be removed from the model. Users can get the number of clusters actually found in the "INDICATORS" table. For example,

# The actual number of clusters found
d = model_u.get_indicators().collect()
d[d.KEY=='FinalClusterCount'][['KEY','VALUE']]
  • It is highly recommended to use a dataset with a key provided in the fit() method.

If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

  • By default, when it is not given, SAP HANA APL guesses the variable description by reading

the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values for these parameters, the user can overwrite the default guess. For example:

model.set_params(
        variable_storages = {
            'ID': 'integer',
            'sepal length (cm)': 'number'
            })
model.set_params(
        variable_value_types = {
            'sepal length (cm)': 'continuous'
            })
model.set_params(
        variable_missing_strings = {
            'sepal length (cm)': '-1'
            })

Examples

>>> from hana_ml.algorithms.apl.clustering import AutoSupervisedClustering
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates Hana DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoSupervisedClustering(nb_clusters=5)
>>> model.fit(data=hana_df, key='id', label='class')

Debriefing

>>> model.get_metrics()
OrderedDict([('SimplifiedSilhouette', 0.3448029020802121), ('RSS', 4675.706587754118),...
>>> model.get_metrics_by_cluster()
{'Frequency': {1: 0.15139770759462357,
  2: 0.39707539649817214,
  3: 0.21549710013468568,
  4: 0.12949066820593166,
  5: 0.10653912756658696},
 'IntraInertia': {1: 0.1604412809425719,
  2: 0.10561882166246073,
  3: 0.12004212490063185,
  4: 0.21030892961293207,
  5: 0.08625667904000194},
 'RSS': {1: 883.710575431686,
  2: 1525.7694977359076,
  3: 941.1302592209537,
  4: 990.765367406523,
  5: 334.3308879590475},
 'SimplifiedSilhouette': {1: 0.3355726073943343,
  2: 0.4231738907945281,
  3: 0.2448648428415369,
  4: 0.38136325589137554,
  5: 0.22353657540054947},
 'TargetMean': {1: 0.1744734931009441,
  2: 0.022912917070469333,
  3: 0.3895408163265306,
  4: 0.7537677775419231,
  5: 0.21207430340557276},
 'TargetStandardDeviation': {1: 0.37951613049526484,
  2: 0.14962591788119842,
  3: 0.48764615116105525,
  4: 0.4308154072006165,
  5: 0.40877719266198526},
 'KL': {1: OrderedDict([('relationship', 0.6840012706191696),
               ('education', 0.675109873839992),
               ('education-num', 0.6751098738399919),
               ('marital-status', 0.5806503390741476),
               ('occupation', 0.46891689485806354),
               ('sex', 0.08802303491483551),
               ('capital-gain', 0.08794254258565125),
               ...

Predicting which cluster a data point belongs to

>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  CLOSEST_CLUSTER_1  DISTANCE_TO_CLOSEST_CENTROID_1
0   30                  3                        0.640378
1   63                  4                        0.611050
2   66                  3                        0.640378
3  110                  4                        0.611050
4  335                  1                        0.851054

Determining the 2 closest clusters

>>> model.set_params(extra_applyout_settings={'mode':'closest_distances', 'nb_distances': 2})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  CLOSEST_CLUSTER_1  ...  CLOSEST_CLUSTER_2  DISTANCE_TO_CLOSEST_CENTROID_2
0   30                  3  ...                  4                        0.730330
1   63                  4  ...                  1                        0.851054
2   66                  3  ...                  4                        0.730330
3  110                  4  ...                  1                        0.851054
4  335                  1  ...                  4                        0.906003

Retrieving the distances to all clusters

>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  DISTANCE_TO_CENTROID_1               ... DISTANCE_TO_CENTROID_5
0   30                0.851054               ...      1.160697
1   63                0.751054               ...      1.160697
2   66                0.906003               ...      1.160697

Saving the model in the schema named 'MODEL_STORAGE' Please see model_storage class for further features of model storage.

>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My model name'
>>> model_storage.save_model(model=model, if_exists='replace')

Reloading the model for further uses Please note that the label has to be specified again prior to calling predict()

>>> model2 = AutoSupervisedClustering()
>>> model2.set_params(label='class')
>>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df)
>>> applyout2.head(3).collect()
    id  CLOSEST_CLUSTER_1  DISTANCE_TO_CLOSEST_CENTROID_1
0   30                  3                        0.640378
1   63                  4                        0.611050
2   66                  3                        0.640378
Attributes
model_hana_ml DataFrame

The trained model content

summary_APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the summary about the model training.

indicators_APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains the various metrics related to the model and its variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table when a prediction was made

Methods

fit(data[, key, label, features, weight])

Fits the model.

fit_predict(data[, key, label, features, weight])

Fits a clustering model and uses it to generate prediction output on the training dataset.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_metrics()

Returns a dictionary containing the metrics about the model.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data)

Predicts which cluster each specified row belongs to.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

set_params(**parameters)

Sets attributes of the current model

set_params(**parameters)

Sets attributes of the current model

Parameters
paramsdictionary

containing attribute names and values

fit(data, key=None, label=None, features=None, weight=None)

Fits the model.

Parameters
datahana_ml DataFrame

The training dataset

keystr, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended.

labelstr, option

The name of the label column. If it is not given, the model 'label' attribute will be taken. If this latter is not defined, an error will be raised.

featureslist of str, optional

The names of the features to be used in the model. If features is not provided, all columns will be taken except the ID and the label columns.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

Returns
selfobject
predict(data)

Predicts which cluster each specified row belongs to.

Parameters
datahana_ml DataFrame

The set of rows for which to generate cluster predictions. This dataset must have the same structure as the one used in the fit() method.

Returns
hana_ml DataFrame

By default, the ID of the closest cluster and the distance to its center are provided. Users can request different outputs by setting the extra_applyout_settings parameter in the model. The extra_applyout_settings parameter is a dictionary with 'mode' and 'nb_distances' as keys. If mode is set to 'closest_distances', cluster IDs and distances to centroids will be provided from the closest to the furthest cluster. The output columns will be:

  • <The key column name>,

  • CLOSEST_CLUSTER_1,

  • DISTANCE_TO_CLOSEST_CENTROID_1,

  • CLOSEST_CLUSTER_2,

  • DISTANCE_TO_CLOSEST_CENTROID_2,

...

If mode is set to 'all_distances', the distances to the centroids of all clusters will be provided in cluster ID order. The output columns will be:

  • ID,

  • DISTANCE_TO_CENTROID_1,

  • DISTANCE_TO_CENTROID_2,

...

nb_distances limits the output to the closest clusters. It is only valid when mode is 'closest_distances' (it will be ignored if mode = 'all distances'). It can be set to 'all' or a positive integer.

Examples

Retrieves the IDs of the 3 closest clusters and the distances to their centroids:

>>> extra_applyout_settings = {'mode': 'closest_distances', 'nb_distances': 3}
>>> model.set_params(extra_applyout_settings=extra_applyout_settings)
>>> out = model.predict(hana_df)
>>> out.head(3).collect()
    id  CLOSEST_CLUSTER_1  ...  CLOSEST_CLUSTER_3  DISTANCE_TO_CLOSEST_CENTROID_3
0   30                  3  ...                  4                        0.730330
1   63                  4  ...                  1                        0.851054
2   66                  3  ...                  4                        0.730330

Retrieves the distances to all clusters:

>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'})
>>> out = model.predict(hana_df)
>>> out.head(3).collect()
   id  DISTANCE_TO_CENTROID_1  DISTANCE_TO_CENTROID_2  ... DISTANCE_TO_CENTROID_5
0  30                0.994595                0.877414  ...              0.782949
1  63                0.994595                0.985202  ...              0.782949
2  66                0.994595                0.877414  ...              0.782949
fit_predict(data, key=None, label=None, features=None, weight=None)

Fits a clustering model and uses it to generate prediction output on the training dataset.

Parameters
datahana_ml DataFrame

The input dataset

keystr, optional

The name of the ID column

labelstr

The name of the label column

featureslist of str, optional.

The names of the feature columns. If features is not provided, all non-ID and non-label columns will be taken.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

Returns
hana_ml DataFrame.
The output is the same as the predict() method.

Notes

Please see the predict() method so as to get different outputs with the 'extra_applyout_settings' parameter.

get_metrics()

Returns a dictionary containing the metrics about the model.

Returns
A dictionary object containing a set of clustering metrics and their values

Examples

>>> model.get_metrics()
{'SimplifiedSilhouette': 0.14668968897882997,
 'RSS': 24462.640041325714,
 'IntraInertia': 3.2233573348587714,
 'Frequency': {
    1: 0.3167862345729914,
    2: 0.35590005772243755,
    3: 0.3273137077045711},
 'IntraInertia': {1: 0.7450335510518645,
     2: 0.708350629565789,
     3: 0.7006679558645009},
 'RSS': {1: 8586.511675872738,
     2: 9171.723951617836,
     3: 8343.554018434477},
 'SimplifiedSilhouette': {1: 0.13324659043317924,
     2: 0.14182734764281074,
     3: 0.1311620470933516},
 'TargetMean': {1: 0.1744734931009441,
      2: 0.022912917070469333,
      3: 0.3895408163265306},
 'TargetStandardDeviation': {1: 0.37951613049526484,
      2: 0.14962591788119842,
      3: 0.48764615116105525},
 'KL': {1: OrderedDict([('hours-per-week', 0.2971627592049324),
             ('occupation', 0.11944355994892383),
             ('relationship', 0.06772624975990414),
             ('education-num', 0.06377345492340795),
             ('education', 0.06377345492340793),
             ...
load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the OID must be given as an identifier. If it is not provided, the whole table is read.

Notice
------
Prior to using a reloaded model for a new prediction, it is necessary to re-specify
the 'label' parameter. Otherwise, the predict() method will fail.
get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

A pandas Dataframe with detailed information about the current version.

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

Parameters
report_name: str
Returns
Statistical report: hana_ml DataFrame
get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns
The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
get_indicators()

Retrieves the Indicator table after model training.

Returns
The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns: list. List of HANA DataFrames respectively corresding to the following tables:

Summary table, Variable roles table, Variable description table, Indicators_table, Profit Curves_table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns
The attribute-values of the modeldictionary
get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns
The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
get_summary()

Retrieves the summary table after model training.

Returns
The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns
None
The model is saved into a table with the following columns:
  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

hana_ml.algorithms.apl.time_series

This module contains the SAP HANA APL Time Series algorithm.

The following class is available:

class hana_ml.algorithms.apl.time_series.AutoTimeSeries(conn_context=None, time_column_name=None, target=None, horizon=1, with_extra_predictable=True, last_training_time_point=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, train_data_=None, sort_data=True, **other_params)

Bases: hana_ml.algorithms.apl.apl_base.APLBase

SAP HANA APL Time Series algorithm.

Parameters
target: str

The name of the column containing the time series data points.

time_column_name: str

The name of the column containing the time series time points. The time column is used as table key. It can be overridden by setting the 'key' parameter through the fit() method.

last_training_time_point: str, optional

The last time point used for model training. The training dataset will contain all data points up to this date. By default, this parameter will be set as the last time point until which the target is not null.

horizon: int, optional

The number of forecasts to be generated by the model upon apply. The time series model will be trained to optimize accuracy on the requested horizon only. The default value is 1.

with_extra_predictable: bool, optional

If set to true, all input variables will be used by the model to generate forecasts. If set to false, only the time and target columns will be used. All other variables will be ignored. This parameter is set to true by default.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value types (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals '???', it will be taken as missing.

extra_applyout_settings: dict, optional

Specifies the prediction outputs. See documentation on predict() method for more details.

other_params: dict, optional

Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are: - force_negative_forecast - force_positive_forecast - forecast_fallback_method - forecast_max_cyclics - forecast_max_lags - forecast_method - smoothing_cycle_length See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value. There is no control in python.

Notes

The input dataset, given as an hana_ml dataframe, must not be a temporary table because the API tries to create a view sorted by the time column. SAP HANA does not allow user to create a view on temporary table. However, even though it is not recommended, to avoid creating the view, user can force the parameter sort_data to False.

When calling the fit_predict() method, the time series model is generated on the fly and not returned. If a model must be saved, please consider using the fit() method instead.

When extra-predictable variables are involved, it is usual to have a single dataset used both for the model training and the forecasting. In this case, the dataset should contain two successive periods:

The first one is used for the model training, ranging from the beginning to the last date where the target value is not null.

The second one is used for the model training, ranging from the the first date where the target value is null.

The content of the output of the get_performance_metrics() method may change depending of the version of SAP HANA APL used with this API. Please refer to the SAP HANA APL documentation to know which metrics will be provided.

Examples

>>> from hana_ml.algorithms.apl.time_series import AutoTimeSeries
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS)
>>> # -- Creates Hana DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CASHFLOWS_FULL')

Creating and fitting the model

>>> model = AutoTimeSeries(time_column_name='Date', target='Cash', horizon=3)
>>> model.fit(data=hana_df)

Debriefing

>>> model.get_model_components()
{'Trend': 'Polynom( Date)',
 'Cycles': 'PeriodicExtrasPred_MondayMonthInd',
 'Fluctuations': 'AR(46)'}
>>> model.get_performance_metrics()
{'MAPE': [0.12853715702893018, 0.12789963348617622, 0.12969031859857874], ...}

Generating forecasts using the forecast() method This method is used to generate forecasts using a signature similar to the one used in PAL. There are two variants of usage as described below:

1) If the model does not use extra-predictable variables (no exogenous variable), users must simply specify the number of forecasts.

>>> train_df = DataFrame(CONN,
                        'SELECT "Date" , "Cash" '
                        'from APL_SAMPLES.CASHFLOWS_FULL ORDER BY 1 LIMIT 100')
>>> model = AutoTimeSeries(time_column_name='Date', target='Cash', horizon=3)
>>> model.fit(train_df)
>>> out = model.forecast(forecast_length=3)
>>> out.collect().tail(5)
           Date                            ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
98   2001-05-23  3057.812544999999772699132909775  4593.966530              NaN              NaN
99   2001-05-25  3037.539714999999887176132440567  4307.893346              NaN              NaN
100  2001-05-26                              None  4206.023158     -3609.599872     12021.646187
101  2001-05-27                              None  4575.162651     -3392.283802     12542.609104
102  2001-05-28                              None  4830.352462     -3239.507360     12900.212284

2) If the model uses extra-predictable variables, users must provide the values of all extra-predictable variables for each time point of the forecast period. These values must be provided as a hana_ml dataframe with the same structure as the training dataset.

>>> # Trains the dataset with extra-predictable variables
>>> train_df = DataFrame(CONN,
...                     'SELECT * '
...                     'from APL_SAMPLES.CASHFLOWS_FULL '
...                     'WHERE "Cash" is not null')
>>> # Extra-predictable variables' values on the forecast period
>>> forecast_df = DataFrame(CONN,
...                        'SELECT * '
...                        'from APL_SAMPLES.CASHFLOWS_FULL '
...                        'WHERE "Cash" is null LIMIT 5')
>>> model = AutoTimeSeries(time_column_name='Date', target='Cash', horizon=3)
>>> model.fit(train_df)
>>> out = model.forecast(data=forecast_df)
>>> out.collect().tail(5)
          Date ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
251  2001-12-29   None  6864.371407      -224.079492     13952.822306
252  2001-12-30   None  6889.515324      -211.264912     13990.295559
253  2001-12-31   None  6914.766513      -187.180923     14016.713949
254  2002-01-01   None  6940.124974              NaN              NaN
255  2002-01-02   None  6965.590706              NaN              NaN

Generating forecasts with the predict() method. The predict() method allows users to apply a fitted model on a dataset different from the training dataset. For example, users can train a dataset on the first quarter (January to March) and apply the model on a dataset of different period (March to May).

>>> # Trains the model on the first quarter, from January to March
>>> train_df = DataFrame(CONN,
...                     'SELECT "Date" , "Cash" '
...                     'from APL_SAMPLES.CASHFLOWS_FULL '
...                     "where "Date" between '2001-01-01' and '2001-03-31'"
...                     " ORDER BY 1")
>>> model.fit(train_df)
>>> # Forecasts on a shifted period, from March to May
>>> test_df = DataFrame(CONN,
...                    'SELECT "Date", "Cash" '
...                    'from APL_SAMPLES.CASHFLOWS_FULL '
...                    "where "Date" between '2001-03-01' and '2001-05-31'"
...                    " ORDER BY 1")
>>> out = model.predict(test_df)
>>> out.collect().tail(5)
          Date                            ACTUAL     PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
60  2001-05-30  3837.196734000000105879735597214   4630.223083              NaN              NaN
61  2001-05-31  2911.884261000000151398126928726   4635.265982              NaN              NaN
62  2001-06-01                              None   4538.516542     -1087.461104     10164.494188
63  2001-06-02                              None   4848.815364     -5090.167255     14787.797983
64  2001-06-03                              None   4853.858263     -5138.553275     14846.269801

Using the fit_predict() method This method enables the user to fit a model and generate forecasts on a single call, and thus get results faster. However, the model is created on the fly and deleted after use, so the user will not be able to save the resulting model.

>>> model.fit_predict(hana_df)
>>> out.collect().tail(5)
           Date            ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.42329499999  6055.761105              NaN              NaN
250  2001-12-28  7111.41669699999  6314.336098              NaN              NaN
251  2002-01-03                           None  7033.880804      4529.462710      9538.298899
252  2002-01-04                           None  6464.557223      3965.343397      8963.771049
253  2002-01-07                           None  6469.141663      3961.414900      8976.868427

Breaking down the time series into trend, cycles, fluctuations and residuals components. If the parameter extra_applyout_settings is set to {'ExtraMode': True}, anytime a forecast method is called, predict(), forecast() or fit_predict(), the output will contain time series components and their corresponding residuals. The prediction columns are suffixed by the horizon number. For instance, 'Cycles_RESIDUALS_3' means the residual of the cycle component in the third horizon.

>>> model.fit(train_df)
>>> model.set_params(extra_applyout_settings={'ExtraMode': True})
>>> out = model.predict(hana_df)
>>> out.collect().tail(5)
               Date              ACTUAL        ...  Cycles_RESIDUALS_3  Fluctuations_RESIDUALS_3
249  2001-12-27  5995.42329499392507553        ...               32.51                  4.48e-13
250  2001-12-28  7111.41669699455205917        ...             -644.77                  1.14e-13
251  2002-01-03                    None        ...                 NaN                       NaN
252  2002-01-04                    None        ...                 NaN                       NaN
253  2002-01-07                    None        ...                 NaN                       NaN
Attributes
model_: hana_ml DataFrame

The trained model content

summary_: APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the summary about the model training.

indicators_: APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains the various metrics related to the model and its variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_: APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_: hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table that is produced when making predictions.

train_data_: hana_ml DataFrame

The train dataset

sort_data: bool

If True, a temporary view is created on the dataset to sort data by time. However, users can provide directly a view with sorted dates. In this case, they must set sort_data to False to avoid creating a new view. The default value is True. WARNING: it is recommended to leave this parameter by default so the data is guaranteed to be read in sorted order. If the data is not sorted, the model will fail.

Methods

fit(data[, key, features])

Fits the model.

fit_predict(data[, key, features, horizon])

Fits a model and generate forecasts in a single call to the FORECAST APL function.

forecast([forecast_length, data])

Uses the fitted model to generate out-of-sample forecasts.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_horizon_wide_metric([metric_name])

Returns value of performance metric (MAPE, sMAPE, ...) averaged on the forecast horizon.

get_indicators()

Retrieves the Indicator table after model training.

get_model_components()

Returns a dictionary containing the description of the model components, that is trend, cycles and fluctuations, used by the model to generate forecasts.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns a dictionary containing the performance metrics of the model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data[, apply_horizon, ...])

Uses the fitted model to generate forecasts.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

set_params(**parameters)

Sets attributes of the current model.

set_params(**parameters)

Sets attributes of the current model.

Parameters
parameters: dict

Contains attribute names and values in the form of keyword arguments

fit(data, key=None, features=None)

Fits the model.

Parameters
data: hana_ml DataFrame

The training dataset

key: str, optional

The column used as row identifier of the dataset. This column corresponds to the time column name. As a result, setting this parameter will overwrite the time_column_name model setting.

features: list of str, optional

The names of the feature columns, meaning the date column and the extra-predictive variables. If features is not provided, it defaults to all columns except the target column.

Returns
self: object
predict(data, apply_horizon=None, apply_last_time_point=None)

Uses the fitted model to generate forecasts.

Parameters
data: hana_ml DataFrame

The input dataset used for predictions

apply_horizon: int, optional

The number of forecasts to generate. By default, the number of forecasts is the horizon on which the model was trained.

apply_last_time_point: str, optional

The time point corresponding to the start of the forecast period. Forecasts will be generated starting from the next time point after the 'apply_last_time_point'. By default, this parameter is set to the value of 'last_training_time_point' known from the model training.

Returns
hana_ml DataFrame
By default the output contains the following columns:
  • <the name of the time column>

  • ACTUAL: the actual value of time series

  • PREDICTED: the forecast value

  • LOWER_INT_95PCT: the lower limit of 95% confidence interval

  • UPPER_INT_95PCT: the upper limit of 95% confidence interval

If ExtraMode is set to true, the output dataframe will also contain the breaking down of the time series into a trend, cycles, fluctuations and residuals components.

Examples

Default output

>>> out = model.predict(hana_df)
>>> out.collect().tail(5)
       Date            ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.42329499999  6055.761105              NaN              NaN
250  2001-12-28  7111.41669699999  6314.336098              NaN              NaN
251  2002-01-03                           None  7033.88080      4529.46271      9538.29889
252  2002-01-04                           None  6464.55722      3965.34339      8963.77104
253  2002-01-07                           None  6469.14166      3961.41490      8976.86842

Retrieving forecasts and components (predicted, trend, cycles and fluctuations). The output columns are suffixed with the horizon index. For example, Trend_1 means the trend component of the first horizon.

>>> model.set_params(extra_applyout_settings={'ExtraMode': True})
>>> out = model.predict(hana_df)
>>> out.collect().tail(5)
        Date                               ACTUAL  PREDICTED_1      Trend_1          249  2001-12-27  5995.423294999999598076101392507553  6055.761105  6814.405390   ...
250  2001-12-28  7111.416696999999658146407455205917  6314.336098  6839.334762   ...
251  2002-01-03                                 None  7033.880804  6991.163710   ...
252  2002-01-04                                 None  6464.557223  7016.843985   ...
253  2002-01-07                                 None  6469.141663  7094.528433   ...
fit_predict(data, key=None, features=None, horizon=None)

Fits a model and generate forecasts in a single call to the FORECAST APL function. This method offers a faster way to perform the model training and forecasting.

However, the user will not have access to the model used internally since it is deleted after the computation of the forecasts.

Parameters
data: hana_ml DataFrame

The input time series dataset

key: str, optional

The date column name. By default, it is equal to the model parameter time_column_name. If it is given, the model parameter time_column_name will be overwritten.

features: list of str, optional

The column names corresponding to the extra-predictable variables (exogenous variables). If features is not provided, it is equal to all columns except the target column.

horizon: int, optional

The number of forecasts to generate. The default value equals to the horizon parameter of the model.

Returns
hana_ml DataFrame

The output is the same as the predict() method.

forecast(forecast_length=None, data=None)

Uses the fitted model to generate out-of-sample forecasts. The model is supposed to be already fitted with a given dataset (training dataset). This method forecasts over a number of steps after the end of the training dataset. When there are extra-predictive variable (exogenous variables), the input parameter data is required. It must contain the values of the extra-predictable variables for the forecast period. If there is no extra-predictive variable, only the forecast_length parameter is needed.

Parameters
forecast_length: int, optional

The number of forecasts to generate from the end of the train dataset. This parameter is by default the horizon specified in the model parameter.

data: hana_ml DataFrame, optional

The time series with extra-predictable variables used for forecasting. This parameter is required if extra-predictive variables are used in the model. When this parameter is given, the parameter 'forecast_length' is ignored.

Returns
hana_ml DataFrame

The output is the same as the predict() method.

Examples

Case where there is no extra-predictable variable:

>>> train_df = DataFrame(CONN,
                         'SELECT "Date" , "Cash" '
                         'from APL_SAMPLES.CASHFLOWS_FULL '
                         'where "Cash" is not null '
                         'ORDER BY 1')
>>> print(train_df.collect().tail(5))
            Date         Cash
246  2001-12-20  6382.441052
247  2001-12-21  5652.882539
248  2001-12-26  5081.372996
249  2001-12-27  5995.423295
250  2001-12-28  7111.416697
>>> model = AutoTimeSeries(CONN, time_column_name='Date',
                           target='Cash',
                           horizon=3)
>>> model.fit(train_df)
>>> out = model.forecast(forecast_length=3)
>>> out.collect().tail(5)
           Date                        ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.42329499999901392507553  6814.405390              NaN              NaN
250  2001-12-28  7111.41669699999907455205917  6839.334762              NaN              NaN
251  2001-12-29                          None  6864.371407      -224.079492     13952.822306
252  2001-12-30                          None  6889.515324      -211.264912     13990.295559
253  2001-12-31                          None  6914.766513      -187.180923     14016.713949

Case where there are extra-predictable variables:

>>> train_df = DataFrame(CONN,
                        'SELECT * '
                        'from APL_SAMPLES.CASHFLOWS_FULL '
                        'WHERE "Cash" is not null '
                        'ORDER BY 1')
>>> print(train_df.collect().tail(5))
           Date  WorkingDaysIndices     ...       BeforeLastWMonth         Cash
246  2001-12-20                  13     ...                      1  6382.441052
247  2001-12-21                  14     ...                      1  5652.882539
248  2001-12-26                  15     ...                      0  5081.372996
249  2001-12-27                  16     ...                      0  5995.423295
250  2001-12-28                  17     ...                      0  7111.416697
>>> # Extra-predictable variables to be provided as the forecast period
>>> forecast_df = DataFrame(CONN,
                           'SELECT * '
                           'from APL_SAMPLES.CASHFLOWS_FULL '
                           'WHERE "Cash" is null '
                           'ORDER BY 1 '
                           'LIMIT 3')
>>> print(forecast_df.collect())
         Date  WorkingDaysIndices  ...   BeforeLastWMonth  Cash
0  2002-01-03                   0  ...                  0  None
1  2002-01-04                   1  ...                  0  None
2  2002-01-07                   2  ...                  0  None
>>> model = AutoTimeSeries(CONN,
                           time_column_name='Date',
                           target='Cash',
                           horizon=3)
>>> model.fit(train_df)
>>> out = model.forecast(data=forecast_df)
>>> out.collect().tail(5)
           Date                          ACTUAL  PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.4232949999996101392507553    6814.41              NaN              NaN
250  2001-12-28  7111.4166969999996407455205917    6839.33              NaN              NaN
251  2001-12-29                            None    6864.37          -224.08         13952.82
252  2001-12-30                            None    6889.52          -211.26         13990.30
253  2001-12-31                            None    6914.77          -187.18         14016.71
get_model_components()

Returns a dictionary containing the description of the model components, that is trend, cycles and fluctuations, used by the model to generate forecasts.

Returns
A dictionary with 3 possible keys: 'Trend', 'Cycles', 'Fluctuations'. For example:
>>> model.get_model_components()
{
    "Trend": "Linear(TIME)",
     "Cycles": None,
     "Fluctuations": "AR(36)"
}
get_performance_metrics()

Returns a dictionary containing the performance metrics of the model. The metrics are provided for each forecast horizon.

Returns
Dictionary

The dictionary contains the performance metrics of the current model. Each metric is associated to a list containing <horizon> elements. This list contains the values of the metric measured for horizon 1 to <horizon>.

get_horizon_wide_metric(metric_name='MAPE')

Returns value of performance metric (MAPE, sMAPE, ...) averaged on the forecast horizon.

Parameters
metric_name: str

Default value equals 'MAPE'. Possible values: 'MAPE', 'MPE', 'MeanAbsoluteError', 'RootMeanSquareError', 'SMAPE', 'L1', 'L2', 'P2', 'R2', 'U2'

Returns
float

The metric value averaged on the forecast horizon. It is based on validation partition.

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters
schema_name: str

The schema name

table_name: str

The table name

oidstr, optional

If the table contains several models, the OID must be given as an identifier. If it is not provided, the whole table is read.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

A pandas Dataframe with detailed information about the current version.

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

Parameters
report_name: str
Returns
Statistical report: hana_ml DataFrame
get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns
The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
get_indicators()

Retrieves the Indicator table after model training.

Returns
The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns: list. List of HANA DataFrames respectively corresding to the following tables:

Summary table, Variable roles table, Variable description table, Indicators_table, Profit Curves_table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns
The attribute-values of the modeldictionary
get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns
The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
get_summary()

Retrieves the summary table after model training.

Returns
The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns
None
The model is saved into a table with the following columns:
  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

hana_ml.algorithms.apl.gradient_boosting_classification

This module provides the SAP HANA APL gradient boosting classification algorithm.

The following classes are available:

class hana_ml.algorithms.apl.gradient_boosting_classification.GradientBoostingClassifier(conn_context=None, early_stopping_patience=None, eval_metric=None, learning_rate=None, max_depth=None, max_iterations=None, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.gradient_boosting_classification._GradientBoostingClassifierBase

SAP HANA APL Gradient Boosting Multiclass Classifier algorithm.

Parameters
conn_contextConnectionContext, optional

The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().

early_stopping_patience: int, optional

If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. Please refer to APL documentation for default value.

eval_metric: str, optional

The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are 'MultiClassClassificationError' and 'MultiClassLogLoss'. Please refer to APL documentation for default value..

learning_rate: float, optional

The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. Please refer to APL documentation for default value.

max_depth: int, optional

The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. Please refer to APL documentation for default value.

max_iterations: int, optional

The maximum number of boosting iterations to fit the model. The default value is 1000.

number_of_jobs: int, optional

The number of threads allocated to the model training and apply parallelization. Please refer to APL documentation for default value.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.

extra_applyout_settings: dict, optional

Determines the output of the predict() method. The possible values are: - By default (None value): the default output.

  • <KEY>: the key column if it provided in the dataset

  • TRUE_LABEL: the class label if provided in the dataset

  • PREDICTED: the predicted label

  • PROBABILITY: the probability of the prediction(confidence)

  • {'APL/ApplyExtraMode': 'AllProbabilities'}: the probabilities for each class.
    • <KEY>: the key column if provided in the dataset

    • TRUE_LABEL: the class label if given in the dataset

    • PREDICTED: the predicted label

    • PROBA_<label_value1>: the probability for the class <label_value1>

    ... - PROBA_<label_valueN>: the probability for the class <label_valueN>

  • {'APL/ApplyExtraMode': 'Individual Contributions'}: the feature importance for every

sample
  • <KEY>: the key column if provided in the dataset

  • TRUE_LABEL: the class label when if provided in the dataset

  • PREDICTED: the predicted label

  • gb_contrib_<VAR1>: the contribution of the variable VAR1 to the score

... - gb_contrib_<VARN>: the contribution of the variable VARN to the score - gb_contrib_constant_bias: the constant bias contribution to the score

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • 'cutting_strategy'

  • 'interactions'

  • 'interactions_max_kept'

  • 'variable_selection_max_nb_of_final_variables'

See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Users are free to input any possible value. Please see Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. The key is particularly useful to join the predictions output to the input dataset.

By default, if not provided, SAP HANA APL guesses the variable description by reading the first 100 rows. But, the results may be incorrect. The user can overwrite the guessed description by explicitly setting the variable_storages, variable_value_types and variable_missing_strings parameters. For example:

model.set_params(
        variable_storages = {
            'ID': 'integer',
            'sepal length (cm)': 'number'
            })
model.set_params(
        variable_value_types = {
            'sepal length (cm)': 'continuous'
            })
model.set_params(
        variable_missing_strings = {
            'sepal length (cm)': '-1'
            })

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification         import GradientBoostingClassifier
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates hana_ml DataFrame
>>> hana_df = DataFrame(CONN,
                        'SELECT "id", "class", "capital-gain", '
                        '"native-country" from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = GradientBoostingClassifier()
>>> model.fit(hana_df, label='native-country', key='id')

Getting variable interactions

>>> model.set_params(other_train_apl_aliases={
...     'APL/Interactions': 'true',
...     'APL/InteractionsMaxKept': '3'
... })
>>> model.fit(data=self._df_train, key=self._key, label=self._label)
>>> # Checks interaction info in INDICATORS table
>>> output = model.get_indicators().filter("KEY LIKE 'Interaction%'").collect()

Debriefing

>>> # Global performance metrics of the model
>>> model.get_performance_metrics()
{'BalancedErrorRate': 0.9761904761904762, 'BalancedClassificationRate': 0.023809523809523808,
...
>>> # Performance metrics of the model for each class
>>> model.get_metrics_per_class()
{'Precision': {'Cambodia': 0.0, 'Canada': 0.0, 'China': 0.0, 'Columbia': 0.0...
>>> model.get_feature_importances()
{'Gain': OrderedDict([('class', 0.7713800668716431), ('capital-gain', 0.22861991822719574)])}

Making predictions

>>> # Default output
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
    id     TRUE_LABEL      PREDICTED  PROBABILITY
0   30  United-States  United-States     0.89051
1   63  United-States  United-States     0.89051
2   66  United-States  United-States     0.89051
>>> # All probabilities
>>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'AllProbabilities'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
          id     TRUE_LABEL      PREDICTED      PROBA_?     PROBA_Cambodia  ...
35194  19272  United-States  United-States    0.016803            0.000595  ...
20186  39624  United-States  United-States    0.017564            0.001063  ...
43892  38759  United-States  United-States    0.019812            0.000353  ...
>>> # Individual Contributions
>>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
   id     TRUE_LABEL      PREDICTED  gb_contrib_class  gb_contrib_capital-gain  ...
0  30  United-States  United-States         -0.025366                -0.014416  ...
1  63  United-States  United-States         -0.025366                -0.014416  ...
2  66  United-States  United-States         -0.025366                -0.014416  ...

Saving the model in the schema named 'MODEL_STORAGE'

>>> from hana_ml.model_storage import ModelStorage
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My model name'
>>> model_storage.save_model(model=model, if_exists='replace')

Reloading the model for new predictions >>> model2 = model_storage.load_model(name='My model name') >>> out2 = model2.predict(data=hana_df)

Please see model_storage class for further features of model storage

Attributes
label: str

The target column name. This attribute is set when the fit() method is called.

model_: hana_ml DataFrame

The trained model content

summary_: APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the content of the model training summary.

indicators_: APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains various metrics related to the model and model variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_: APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_: hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table when a prediction was made

Methods

build_report()

Build model report.

fit(data[, key, features, label, weight, ...])

Fits the model.

generate_html_report(filename[, metric_sampling])

Save model report as a html file.

generate_notebook_iframe_report([...])

Render model report as a notebook iframe.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

get_evalmetrics()

Returns the values of the evaluation metric at each iteration.

get_feature_importances()

Returns the feature importances.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_metrics_per_class()

Returns the performance for each class.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

predict(data)

Makes predictions with the fitted model.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

score(data)

Returns the mean accuracy on the provided test dataset.

set_metric_samplings([roc_sampling, ...])

Set metric samplings to report builder.

set_params(**parameters)

Sets attributes of the current model.

set_params(**parameters)

Sets attributes of the current model.

Parameters
parameters: dict

The names and values of the attributes to change

get_metrics_per_class()

Returns the performance for each class.

Returns
A dictionary.
build_report()

Build model report.

set_metric_samplings(roc_sampling=None, other_samplings: Optional[dict] = None)

Set metric samplings to report builder.

Parameters
roc_samplingSampling, optional

ROC sampling.

other_samplingsdict, optional

Key is column name of metric table.

  • CUMGAINS

  • RANDOM_CUMGAINS

  • PERF_CUMGAINS

  • LIFT

  • RANDOM_LIFT

  • PERF_LIFT

  • CUMLIFT

  • RANDOM_CUMLIFT

  • PERF_CUMLIFT

Value is sampling.

Examples

Creating the metric sampings:

>>> roc_sampling = Sampling(method='every_nth', interval=2)
>>> other_samplings = dict(CUMGAINS=Sampling(method='every_nth', interval=2),
                      LIFT=Sampling(method='every_nth', interval=2),
                      CUMLIFT=Sampling(method='every_nth', interval=2))
fit(data, key=None, features=None, label=None, weight=None, build_report=False)

Fits the model.

Parameters
dataDataFrame

The training dataset

keystr, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.

featureslist of str, optional

The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.

labelstr, optional

The name of the label column. Default is the last column.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

build_reportbool, optional

Whether to build report or not. Defaults to False.

Returns
selfobject

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.

generate_html_report(filename, metric_sampling=False)

Save model report as a html file.

Parameters
filenamestr

Html file name.

metric_samplingbool, optional

Whether the metric table needs to be sampled.

generate_notebook_iframe_report(metric_sampling=False)

Render model report as a notebook iframe.

Parameters
metric_samplingbool, optional

Whether the metric table needs to be sampled.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

A pandas Dataframe with detailed information about the current version.

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

Returns
The best iteration: int
get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

Parameters
report_name: str
Returns
Statistical report: hana_ml DataFrame
get_evalmetrics()

Returns the values of the evaluation metric at each iteration. These values are based on the estimation dataset.

Returns
A dictionary:

{'<MetricName>': <List of values>}

get_feature_importances()

Returns the feature importances.

Returns
feature importancesdict

{ <importance_metric> : OrderedDictionary({ <feature_name> : <value> })

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns
The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
get_indicators()

Retrieves the Indicator table after model training.

Returns
The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns: list. List of HANA DataFrames respectively corresding to the following tables:

Summary table, Variable roles table, Variable description table, Indicators_table, Profit Curves_table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns
The attribute-values of the modeldictionary
get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns
A dictionary with metric name as key and metric value as value.
get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns
The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
get_summary()

Retrieves the summary table after model training.

Returns
The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(schema_name, table_name, oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Loads the model from a table.

Parameters
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

predict(data)

Makes predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the 'extra_applyout_settings' parameter in the model. This parameter is described with examples in the class section.

Parameters
data: hana_ml DataFrame

The input dataset used for prediction

Returns
Prediction output: hana_ml DataFrame
The default output is (if the model 'extra_applyout_settings' parameter is unset):
  • ID: the key column

  • TRUE_LABEL: the true label if it is given in the input dataset

  • PREDICTED: the predicted label

  • PROBABILITY: the probability of the predicted label

In multinomial classification, users can request the probabilities of all classes by
setting the parameter 'extra_applyout_settings' to
{'APL/ApplyExtraMode': 'AllProbabilities'}.
The output will be:
  • ID: the key column

  • TRUE_LABEL: the true label if it is given in the input dataset

  • PREDICTED: the predicted label

  • PROBA_<class_1>: the probability of the class <class_1>

... - PROBA_<class_n>: the probability of the class <class_n>

To get the individual contributions of each variable for each individual sample,
the 'extra_applyout_settings' parameter must be set to
{'APL/ApplyExtraMode': 'Individual Contributions'}.
The output will contain the following columns:
  • ID: key column,

  • TRUE_LABEL: the actual label

  • PREDICTED: the predicted label

  • gb_contrib_<VAR1>: the contribution of the variable <VAR1> to the score

... - gb_contrib_<VARN>: the contribution of the variable <VARN> to the score - gb_contrib_constant_bias: the constant bias contribution to the score

Users can also set APL/ApplyExtraMode with other values, for instance:
'extra_applyout_settings' = {'APL/ApplyExtraMode': 'BestProbabilityAndDecision'}.
New SAP Hana APL settings may be provided over time, so please check the SAP HANA APL
documentation to know which settings are available:
See Function Reference > Predictive Model Services > APPLY_MODEL > OPERATION_CONFIG
Parameters in the SAP HANA APL Reference Guide.
save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns
None
The model is saved into a table with the following columns:
  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

score(data)

Returns the mean accuracy on the provided test dataset.

Parameters
data: hana_ml DataFrame

The test dataset used to compute the score. The labels must be provided in the dataset.

Returns
mean average accuracy: float
class hana_ml.algorithms.apl.gradient_boosting_classification.GradientBoostingBinaryClassifier(conn_context=None, early_stopping_patience=None, eval_metric=None, learning_rate=None, max_depth=None, max_iterations=None, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.gradient_boosting_classification._GradientBoostingClassifierBase

SAP HANA APL Gradient Boosting Binary Classifier algorithm. It is very similar to GradientBoostingClassifier, the multiclass classifier. Its particularity lies in the provided metrics which are specific to binary classification.

Parameters
conn_contextConnectionContext, optional

The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().

early_stopping_patience: int, optional

If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. Please refer to APL documentation for default value.

eval_metric: str, optional

The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are 'LogLoss','AUC' and 'ClassificationError'. Please refer to APL documentation for default value.

learning_rate: float, optional

The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. Please refer to APL documentation for default value.

max_depth: int, optional

The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. The default value is 4.

max_iterations: int, optional

The maximum number of boosting iterations to fit the model. Please refer to APL documentation for default value.

number_of_jobs: int, optional

The number of threads allocated to the model training and apply parallelization. Please refer to APL documentation for default value.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.

extra_applyout_settings: dict, optional

Determines the output of the predict() method. The possible values are: - By default (None value): the default output.

  • <KEY>: the key column if provided in the dataset

  • TRUE_LABEL: the class label if provided in the dataset

  • PREDICTED: the predicted label

  • PROBABILITY: the probability of the prediction(confidence)

  • {'APL/ApplyExtraMode': 'Individual Contributions'}: the individual contributions of each

variable to the score. The output is:
  • <KEY>: the key column if provided in the dataset

  • TRUE_LABEL: the class label if provided in the dataset

  • gb_contrib_<VAR1>: the contribution of the variable VAR1 to the score

... - gb_contrib_<VARN>: the contribution of the variable VARN to the score - gb_contrib_constant_bias: the constant bias contribution to the score

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • 'correlations_lower_bound'

  • 'correlations_max_kept'

  • 'cutting_strategy'

  • 'target_key'

  • 'interactions'

  • 'interactions_max_kept'

  • 'variable_selection_max_nb_of_final_variables'

See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

other_train_apl_aliases: dict, optional

Contains the APL alias for model training. The list of possible aliases depends on the APL version. Please refer to HANA APL documentation about aliases.

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification         import GradientBoostingBinaryClassifier
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS)
>>> # -- Creates hana_ml DataFrame
>>> hana_df = DataFrame(CONN, 'SELECT * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = GradientBoostingBinaryClassifier()
>>> model.fit(hana_df, label='class', key='id')

Getting variable interactions

>>> model.set_params(other_train_apl_aliases={
...     'APL/Interactions': 'true',
...     'APL/InteractionsMaxKept': '3'
... })
>>> model.fit(data=self._df_train, key=self._key, label=self._label)
>>> # Checks interaction info in INDICATORS table
>>> output = model.get_indicators().filter("KEY LIKE 'Interaction%'").collect()

Debriefing

>>> # Global performance metrics of the model
>>> model.get_performance_metrics()
{'LogLoss': 0.2567069689038737, 'PredictivePower': 0.8529, 'PredictionConfidence': 0.9759,
...}
>>> model.get_feature_importances()
{'Gain': OrderedDict([('relationship', 0.3866586685180664),
                      ('education-num', 0.1502334326505661)...

Making predictions

>>> # Default output
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().sample(3) # returns the output as a pandas DataFrame
          id  TRUE_LABEL  PREDICTED  PROBABILITY
44903  41211           0          0    0.871326
47878  36020           1          1    0.993455
17549   6601           0          1    0.673872
>>> # Individual Contributions
>>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().sample(3) # returns the output as a pandas DataFrame
      id  TRUE_LABEL  gb_contrib_age  gb_contrib_workclass  gb_contrib_fnlwgt  ...
0  18448           0       -1.098452             -0.001238           0.060850  ...
1  18457           0       -0.731512             -0.000448           0.020060  ...
2  18540           0       -0.024523              0.027065           0.158083  ...

Saving the model in the schema named 'MODEL_STORAGE' Please see model_storage class for further features of model storage.

>>> from hana_ml.model_storage import ModelStorage
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My model name'
>>> model_storage.save_model(model=model, if_exists='replace')

Reloading the model for new predictions >>> model2 = model_storage.load_model(name='My model name') >>> out2 = model2.predict(data=hana_df)

Attributes
label: str

The target column name. This attribute is set when the fit() method is called.

model_: hana_ml DataFrame

The trained model content

summary_: APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the content of the model training summary.

indicators_: APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains various metrics related to the model and model variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_: APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_: hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table when a prediction was made

Methods

build_report()

Build model report.

fit(data[, key, features, label, weight, ...])

Fits the model.

generate_html_report(filename[, metric_sampling])

Save model report as a html file.

generate_notebook_iframe_report([...])

Render model report as a notebook iframe.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

get_evalmetrics()

Returns the values of the evaluation metric at each iteration.

get_feature_importances()

Returns the feature importances.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

predict(data)

Makes predictions with the fitted model.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

score(data)

Returns the mean accuracy on the provided test dataset.

set_metric_samplings([roc_sampling, ...])

Set metric samplings to report builder.

set_params(**parameters)

Sets attributes of the current model.

set_params(**parameters)

Sets attributes of the current model.

Parameters
parameters: dict

The attribute names and values

get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns
A dictionary with metric name as key and metric value as value.
build_report()

Build model report.

fit(data, key=None, features=None, label=None, weight=None, build_report=False)

Fits the model.

Parameters
dataDataFrame

The training dataset

keystr, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.

featureslist of str, optional

The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.

labelstr, optional

The name of the label column. Default is the last column.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

build_reportbool, optional

Whether to build report or not. Defaults to False.

Returns
selfobject

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.

generate_html_report(filename, metric_sampling=False)

Save model report as a html file.

Parameters
filenamestr

Html file name.

metric_samplingbool, optional

Whether the metric table needs to be sampled.

generate_notebook_iframe_report(metric_sampling=False)

Render model report as a notebook iframe.

Parameters
metric_samplingbool, optional

Whether the metric table needs to be sampled.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

A pandas Dataframe with detailed information about the current version.

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

Returns
The best iteration: int
get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

Parameters
report_name: str
Returns
Statistical report: hana_ml DataFrame
get_evalmetrics()

Returns the values of the evaluation metric at each iteration. These values are based on the estimation dataset.

Returns
A dictionary:

{'<MetricName>': <List of values>}

get_feature_importances()

Returns the feature importances.

Returns
feature importancesdict

{ <importance_metric> : OrderedDictionary({ <feature_name> : <value> })

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns
The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
get_indicators()

Retrieves the Indicator table after model training.

Returns
The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns: list. List of HANA DataFrames respectively corresding to the following tables:

Summary table, Variable roles table, Variable description table, Indicators_table, Profit Curves_table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns
The attribute-values of the modeldictionary
get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns
The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
get_summary()

Retrieves the summary table after model training.

Returns
The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(schema_name, table_name, oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Loads the model from a table.

Parameters
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

predict(data)

Makes predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the 'extra_applyout_settings' parameter in the model. This parameter is described with examples in the class section.

Parameters
data: hana_ml DataFrame

The input dataset used for prediction

Returns
Prediction output: hana_ml DataFrame
The default output is (if the model 'extra_applyout_settings' parameter is unset):
  • ID: the key column

  • TRUE_LABEL: the true label if it is given in the input dataset

  • PREDICTED: the predicted label

  • PROBABILITY: the probability of the predicted label

In multinomial classification, users can request the probabilities of all classes by
setting the parameter 'extra_applyout_settings' to
{'APL/ApplyExtraMode': 'AllProbabilities'}.
The output will be:
  • ID: the key column

  • TRUE_LABEL: the true label if it is given in the input dataset

  • PREDICTED: the predicted label

  • PROBA_<class_1>: the probability of the class <class_1>

... - PROBA_<class_n>: the probability of the class <class_n>

To get the individual contributions of each variable for each individual sample,
the 'extra_applyout_settings' parameter must be set to
{'APL/ApplyExtraMode': 'Individual Contributions'}.
The output will contain the following columns:
  • ID: key column,

  • TRUE_LABEL: the actual label

  • PREDICTED: the predicted label

  • gb_contrib_<VAR1>: the contribution of the variable <VAR1> to the score

... - gb_contrib_<VARN>: the contribution of the variable <VARN> to the score - gb_contrib_constant_bias: the constant bias contribution to the score

Users can also set APL/ApplyExtraMode with other values, for instance:
'extra_applyout_settings' = {'APL/ApplyExtraMode': 'BestProbabilityAndDecision'}.
New SAP Hana APL settings may be provided over time, so please check the SAP HANA APL
documentation to know which settings are available:
See Function Reference > Predictive Model Services > APPLY_MODEL > OPERATION_CONFIG
Parameters in the SAP HANA APL Reference Guide.
save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns
None
The model is saved into a table with the following columns:
  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model

score(data)

Returns the mean accuracy on the provided test dataset.

Parameters
data: hana_ml DataFrame

The test dataset used to compute the score. The labels must be provided in the dataset.

Returns
mean average accuracy: float
set_metric_samplings(roc_sampling: Optional[hana_ml.algorithms.pal.preprocessing.Sampling] = None, other_samplings: Optional[dict] = None)

Set metric samplings to report builder.

Parameters
roc_samplingSampling, optional

ROC sampling.

other_samplingsdict, optional

Key is column name of metric table.

  • CUMGAINS

  • RANDOM_CUMGAINS

  • PERF_CUMGAINS

  • LIFT

  • RANDOM_LIFT

  • PERF_LIFT

  • CUMLIFT

  • RANDOM_CUMLIFT

  • PERF_CUMLIFT

Value is sampling.

Examples

Creating the metric sampings:

>>> roc_sampling = Sampling(method='every_nth', interval=2)
>>> other_samplings = dict(CUMGAINS=Sampling(method='every_nth', interval=2),
                      LIFT=Sampling(method='every_nth', interval=2),
                      CUMLIFT=Sampling(method='every_nth', interval=2))

hana_ml.algorithms.apl.gradient_boosting_regression

This module provides the SAP HANA APL gradient boosting regression algorithm.

The following classes are available:

class hana_ml.algorithms.apl.gradient_boosting_regression.GradientBoostingRegressor(conn_context=None, early_stopping_patience=None, eval_metric=None, learning_rate=None, max_depth=None, max_iterations=None, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.gradient_boosting_base.GradientBoostingBase, hana_ml.visualizers.model_report._UnifiedRegressionReportBuilder

SAP HANA APL Gradient Boosting Regression algorithm.

Parameters
conn_contextConnectionContext, optional

The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().

early_stopping_patience: int, optional

If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. Please refer to APL documentation for default value.

eval_metric: str, optional

The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are 'MAE' and 'RMSE'. Please refer to APL documentation for default value.

learning_rate: float, optional

The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. Please refer to APL documentation for default value.

max_depth: int, optional

The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. Please refer to APL documentation for default value.

max_iterations: int, optional

The maximum number of boosting iterations to fit the model. Please refer to APL documentation for default value.

number_of_jobs: int, optional

The number of threads allocated to the model training and apply parallelization. Please refer to APL documentation for default value.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value types (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.

extra_applyout_settings: dict, optional

Determines the output of the predict() method. The possible values are: - By default (None value): the default output.

  • <KEY>: the key column if provided in the dataset

  • TRUE_LABEL: the actual value if provided

  • PREDICTED: the predicted value

  • {'APL/ApplyExtraMode': 'Individual Contributions'}: the feature importance for every

sample
  • <KEY>: the key column if provided

  • TRUE_LABEL: the actual value if provided

  • PREDICTED: the predicted value

  • gb_contrib_<VAR1>: the contribution of the VAR1 variable to the score

... - gb_contrib_<VARN>: the contribution of the VARN variable to the score - gb_contrib_constant_bias: the constant bias contribution

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • 'correlations_lower_bound'

  • 'correlations_max_kept'

  • 'cutting_strategy'

  • 'interactions'

  • 'interactions_max_kept'

  • 'variable_selection_max_nb_of_final_variables'

See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

other_train_apl_aliases: dict, optional

Users can provide APL aliases as advanced settings to the model. Users are free to input any possible value. Pleae see Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. The key is particularly useful to join the predictions output to the input dataset.

By default, if not provided, SAP HANA APL guesses the variable description by reading the first 100 rows. But, the results may be incorrect. The user can overwrite the guessed description by explicitly setting the variable_storages, variable_value_types and variable_missing_strings parameters. For example:

model.set_params(
        variable_storages = {
            'ID': 'integer',
            'sepal length (cm)': 'number'
            })
model.set_params(
        variable_value_types = {
            'sepal length (cm)': 'continuous'
            })
model.set_params(
        variable_missing_strings = {
            'sepal length (cm)': '-1'
            })

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_regression import GradientBoostingRegressor
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates hana_ml DataFrame
>>> hana_df = DataFrame(CONN,
...                     'SELECT "id", "class", "capital-gain", '
...                     '"native-country", "age" from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = GradientBoostingRegressor()
>>> model.fit(hana_df, label='age', key='id')

Getting variable interactions

>>> model.set_params(other_train_apl_aliases={
...     'APL/Interactions': 'true',
...     'APL/InteractionsMaxKept': '3'
... })
>>> model.fit(data=self._df_train, key=self._key, label=self._label)
>>> # Checks interaction info in INDICATORS table
>>> output = model.get_indicators().filter("KEY LIKE 'Interaction%'").collect()

Debriefing

>>> # Global performance metrics of the model
>>> model.get_performance_metrics()
{'L1': 7.31774, 'MeanAbsoluteError': 7.31774, 'L2': 9.42497, 'RootMeanSquareError': 9.42497, ...
>>> model.get_feature_importances()
{'Gain': OrderedDict([('class', 0.8728259801864624), ('capital-gain', 0.10493823140859604), ...

Making predictions

>>> # Default output
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
          id  TRUE_LABEL  PREDICTED
39184  21772          27         25
16537   7331          33         43
7908   35226          65         42
>>> # Individual Contributions
>>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
     id  TRUE_LABEL  gb_contrib_workclass  gb_contrib_fnlwgt  gb_contrib_education  ...
0  6241          21             -1.330736          -0.385088              0.373539  ...
1  6248          18             -0.784536          -2.191791             -1.788672  ...
2  6253          26             -0.773891           0.358133             -0.185864  ...

Saving the model in the schema named 'MODEL_STORAGE'

>>> from hana_ml.model_storage import ModelStorage
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE')
>>> model.name = 'My model name'
>>> model_storage.save_model(model=model, if_exists='replace')

Reloading the model for new predictions >>> model2 = model_storage.load_model(name='My model name') >>> out2 = model2.predict(data=hana_df)

Please see model_storage class for further features of model storage

Attributes
label: str

The target column name. This attribute is set when the fit() method is called. Users don't need to set it explicitly, except if the model is loaded from a table. In this case, this attribute must be set before calling predict().

model_: hana_ml DataFrame

The trained model content

summary_: APLArtifactTable

The reference to the "SUMMARY" table generated by the model training. This table contains the content of the model training summary.

indicators_: APLArtifactTable

The reference to the "INDICATORS" table generated by the model training. This table contains various metrics related to the model and model variables.

fit_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table generated by the model training

var_desc_: APLArtifactTable

The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training

applyout_: hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the "OPERATION_LOG" table when a prediction was made

Methods

build_report()

Build model report.

fit(data[, key, features, label, weight, ...])

Fits the model.

generate_html_report(filename)

Save model report as a html file.

generate_notebook_iframe_report()

Render model report as a notebook iframe.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

get_evalmetrics()

Returns the values of the evaluation metric at each iteration.

get_feature_importances()

Returns the feature importances.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_model_info()

Get information about an existing model.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

is_fitted()

Checks if the model can be saved.

load_model(schema_name, table_name[, oid])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

predict(data)

Generates predictions with the fitted model.

save_artifact(artifact_df, schema_name, ...)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, ...])

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

score(data)

Computes the R^2 (Coefficient of determination) indicator on the predictions of the provided dataset.

set_params(**parameters)

Sets attributes of the current model.

set_params(**parameters)

Sets attributes of the current model.

Parameters
parameters: dict

The attribute names and values

get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns
A dictionary with metric name as key and metric value as value.
predict(data)

Generates predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the 'extra_applyout_settings' parameter in the model. This parameter is described with examples in the class section.

Parameters
data: hana_ml DataFrame

The input dataset used for prediction

Returns
Prediction output: hana_ml DataFrame
score(data)

Computes the R^2 (Coefficient of determination) indicator on the predictions of the provided dataset.

Parameters
data: hana_ml DataFrame

The dataset used for prediction. It must contain the actual target values so that the score could be computed.

Returns
R2 indicator: float
build_report()

Build model report.

fit(data, key=None, features=None, label=None, weight=None, build_report=False)

Fits the model.

Parameters
dataDataFrame

The training dataset

keystr, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.

featureslist of str, optional

The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.

labelstr, optional

The name of the label column. Default is the last column.

weightstr, optional

The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.

build_reportbool, optional

Whether to build report or not. Defaults to False.

Returns
selfobject

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.

generate_html_report(filename)

Save model report as a html file.

Parameters
filenamestr

Html file name.

generate_notebook_iframe_report()

Render model report as a notebook iframe.

get_apl_version()

Gets the version and configuration information about the installation of SAP HANA APL.

A pandas Dataframe with detailed information about the current version.

Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

Returns
The best iteration: int
get_debrief_report(report_name)

Retrieves a standard statistical report See Statistical Reports in the SAP HANA APL Reference Guide.

Parameters
report_name: str
Returns
Statistical report: hana_ml DataFrame
get_evalmetrics()

Returns the values of the evaluation metric at each iteration. These values are based on the estimation dataset.

Returns
A dictionary:

{'<MetricName>': <List of values>}

get_feature_importances()

Returns the feature importances.

Returns
feature importancesdict

{ <importance_metric> : OrderedDictionary({ <feature_name> : <value> })

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns
The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
get_indicators()

Retrieves the Indicator table after model training.

Returns
The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
get_model_info()

Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.

Returns: list. List of HANA DataFrames respectively corresding to the following tables:

Summary table, Variable roles table, Variable description table, Indicators_table, Profit Curves_table

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns
The attribute-values of the modeldictionary
get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns
The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
get_summary()

Retrieves the summary table after model training.

Returns
The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
is_fitted()

Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.

Returns
bool

True if the model is ready to be saved.

load_model(schema_name, table_name, oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Loads the model from a table.

Parameters
schema_name: str

The schema name

table_name: str

The table name

oidstr. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters
schema_name: str

The schema name

artifact_dfhana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Warning

This method is deprecated. Please use hana_ml.model_storage.ModelStorage.

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns
None
The model is saved into a table with the following columns:
  • "OID" NVARCHAR(50), -- Serve as ID

  • "FORMAT" NVARCHAR(50), -- APL technical info

  • "LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model