hana_ml.algorithms.apl package

The Algorithms APL Package consists of the following sections:

hana_ml.algorithms.apl.classification

This module provides the SAP HANA APL binary classification algorithm.

The following classes are available:

class hana_ml.algorithms.apl.classification.AutoClassifier(conn_context, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.robust_regression_base.RobustRegressionBase

SAP HANA APL Binary Classifier algorithm.

Parameters:
conn_context : ConnectionContext

The connection object to an SAP HANA database

variable_auto_selection : bool, optional

When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables.

polynomial_degree : int, optional

The polynomial degree of the model. Default is 1.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.

extra_applyout_settings: dict optional

Defines other outputs the model should generate in addition to the predicted values. For example: {‘APL/ApplyReasonCode’:‘3;Mean;Below;False’} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Reference Guide.

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • ‘correlations_lower_bound’
  • ‘correlations_max_kept’
  • ‘cutting_strategy’
  • ‘exclude_low_predictive_confidence’
  • ‘risk_fitting’
  • ‘risk_fitting_min_cumulated_frequency’
  • ‘risk_fitting_nb_pdo’
  • ‘risk_fitting_use_weights’
  • ‘risk_gdo’
  • ‘risk_mode’
  • ‘risk_pdo’
  • ‘risk_score’
  • ‘score_bins_count’
  • ‘target_key’
  • ‘variable_selection_best_iteration’
  • ‘variable_selection_min_nb_of_final_variables’
  • ‘variable_selection_max_nb_of_final_variables’
  • ‘variable_selection_mode’
  • ‘variable_selection_nb_variables_removed_by_step’
  • ‘variable_selection_percentage_of_contribution_kept_by_step’
  • ‘variable_selection_quality_bar’
  • ‘variable_selection_quality_criteria’

See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

Notes

It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

By default, when it is not given, SAP HANA APL guesses the variable description by reading the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values in these parameters, the user can overwrite the default guess. For example:

model.set_params(
        variable_storages = {
            'ID': 'integer',
            'sepal length (cm)': 'number'
            })
model.set_params(
        variable_value_types = {
            'sepal length (cm)': 'continuous'
            })
model.set_params(
        variable_missing_strings = {
            'sepal length (cm)': '-1'
            })
model.set_params(
    extra_applyout_settings={
            'APL/ApplyReasonCode':'3;Mean;Below;False'
            })

Examples

>>> from hana_ml.algorithms.apl.classification import AutoClassifier
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates hana_ml DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoClassifier(conn_context=CONN, variable_auto_selection=True)
>>> model.fit(hana_df, label='class', key='id')

Making the predictions

>>> applyout_df = model.predict(hana_df)
>>> apply_out_df.collect() # returns the output as a pandas DataFrame
    id  TRUE_LABEL  PREDICTED  PROBABILITY
0   30           0          0     0.688153
1   63           0          0     0.677693
2   66           0          0     0.700221

Debriefing

>>> model.get_performance_metrics()
OrderedDict([('L1', 0.2522171212463023), ('L2', 0.32254434028379236), ...
>>> model.get_feature_importances()
OrderedDict([('marital-status', 0.2172766583204266), ('capital-gain', 0.19521247617062215),...

Saving the model

>>> model.save_model(schema_name='MySchema', table_name='MyTable',if_exists='replace')

Reloading model and predicting

>>> model2 = AutoClassifier(conn_context=CONN)
>>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df)
>>> applyout2
    id  TRUE_LABEL  PREDICTED  PROBABILITY
0   30           0          0     0.688153
1   63           0          0     0.677693
2   66           0          0     0.700221
Attributes:
model_ : hana_ml DataFrame

The trained model content

summary_ : APLArtifactTable

The reference to the “SUMMARY” table generated by the model training. This table contains the content of the model training summary.

indicators_ : APLArtifactTable

The reference to the “INDICATORS” table generated by the model training. This table contains various metrics related to model and model variables.

fit_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table generated by the model training

var_desc_ : APLArtifactTable

The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training

applyout_ : hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table when a prediction was made

Methods

fit(data[, key, features, label]) Fits the model.
get_feature_importances([method]) Gets the feature importances (MaximumSmartVariableContribution).
get_fit_operation_log() Retrieves the operation log table after the model training.
get_indicators() Retrieves the Indicator table after model training.
get_params() Retrieves attributes of the current object.
get_performance_metrics() Gets the model performance metrics of the last model training.
get_predict_operation_log() Retrieves the operation log table after the model training.
get_summary() Retrieves the summary table after model training.
load_model(schema_name, table_name[, oid]) Loads the model from a table.
predict(data) Makes predictions with the fitted model.
save_artifact(artifact_df, schema_name, …) Saves an artifact, a temporary table, into a permanent table.
save_model(schema_name, table_name[, …]) Saves the model into a table.
score(data) Returns the mean accuracy on the provided test dataset.
set_params(**parameters) Sets attributes of the current model.
fit(data, key=None, features=None, label=None)

Fits the model.

Parameters:
data : DataFrame

The training dataset

key : str, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended usage. See notes below.

features : list of str, optional

The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.

label : str, optional

The name of the label column. Default is the last column.

Returns:
self : object

Notes

It is highly recommended to use a dataset with key in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with key, because the model will not expect it.

get_feature_importances(method=None)

Gets the feature importances (MaximumSmartVariableContribution).

Parameters:
method : str, optional

The method to be used to measure the feature contributions. It is not used for binary-classification and regression. It is only used for gradient boosting algorithm.

Returns:
feature importances : An OrderedDict { feature_name
get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns:
The reference to OPERATION_LOG table : hana_ml DataFrame
This table provides detailed logs of the last model training
get_indicators()

Retrieves the Indicator table after model training.

Returns:
The reference to INDICATORS table : hana_ml DataFrame
This table provides the performance metrics of the last model training
get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns:
The attribute-values of the model : dictionary
get_performance_metrics()

Gets the model performance metrics of the last model training.

Returns:
An OrderedDict with metric name as key and metric value as value.
For example:
rderedDict([(‘L1’, 8.59885654599923),

(‘L2’, 11.012352163260505), (‘LInf’, 67.0), (‘ErrorMean’, 0.33833594458645944), …

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns:
The reference to the OPERATION_LOG table : hana_ml DataFrame
This table provides detailed logs about the last prediction
get_summary()

Retrieves the summary table after model training.

Returns:
The reference to the SUMMARY table : hana_ml DataFrame
This contains execution summary of the last model training
load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters:
schema_name: str

The schema name

table_name: str

The table name

oid : str. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

predict(data)

Makes predictions with the fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.

Parameters:
data : hana_ml DataFrame

The dataset used for prediction

Returns:
Prediction output: hana_ml DataFrame
The dataframe contains the following columns:
- KEY : the key column if it was provided in the dataset
- TRUE_LABEL : the class label when it was given in the dataset
- PREDICTED : the predicted label
- PROBABILITY : the probability that the current row is predicted as positive
- SCORING_VALUE : the unnormalized scoring value
save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters:
schema_name: str

The schema name

artifact_df : hana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
The behavior when the table already exists:
  • fail: Raises a ValueError
  • replace: Drops the table before inserting new values
  • append: Inserts new values to the existing table
new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters:
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
The behavior when the table already exists:
  • fail: Raises an Error
  • replace: Drops the table before inserting new values
  • append: Inserts new values to the existing table
new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns:
None
The model is saved into a table with the following columns:
  • “OID” NVARCHAR(50), – Serve as ID
  • “FORMAT” NVARCHAR(50), – APL technical info
  • “LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model
score(data)

Returns the mean accuracy on the provided test dataset.

Parameters:
data : hana_ml DataFrame

The test dataset used to compute the score. The labels must be provided in the dataset.

Returns:
mean average accuracy: float
set_params(**parameters)

Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.

Parameters:
params : dictionary

The attribute names and values

hana_ml.algorithms.apl.regression

This module contains SAP HANA APL regression algorithm.

The following classes are available:

class hana_ml.algorithms.apl.regression.AutoRegressor(conn_context, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.robust_regression_base.RobustRegressionBase

This module provides the SAP HANA APL regression algorithm.

Parameters:
conn_context : ConnectionContext

The connection object to an SAP HANA database

variable_auto_selection : bool optional

When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables

polynomial_degree : int optional

The polynomial degree of the model. Default is 1.

variable_storages: dict optional

Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.

variable_value_types: dict optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.

extra_applyout_settings: dict optional

Defines other outputs the model should generate in addition to the predicted values. For example: {‘APL/ApplyReasonCode’:‘3;Mean;Below;False’} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Reference Guide.

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • ‘correlations_lower_bound’
  • ‘correlations_max_kept’
  • ‘cutting_strategy’
  • ‘exclude_low_predictive_confidence’
  • ‘risk_fitting’
  • ‘risk_fitting_min_cumulated_frequency’
  • ‘risk_fitting_nb_pdo’
  • ‘risk_fitting_use_weights’
  • ‘risk_gdo’
  • ‘risk_mode’
  • ‘risk_pdo’
  • ‘risk_score’
  • ‘score_bins_count’
  • ‘variable_auto_selection’
  • ‘variable_selection_best_iteration’
  • ‘variable_selection_min_nb_of_final_variables’
  • ‘variable_selection_max_nb_of_final_variables’
  • ‘variable_selection_mode’
  • ‘variable_selection_nb_variables_removed_by_step’
  • ‘variable_selection_percentage_of_contribution_kept_by_step’
  • ‘variable_selection_quality_bar’
  • ‘variable_selection_quality_criteria’

See Common APL Aliases for Model Training in SAP HANA APL Reference Guide.

Notes

It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

Examples

>>> from hana_ml.algorithms.apl.regression import AutoRegressor
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA Database

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates Hana DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoRegressor(conn_context=CONN, variable_auto_selection=True)
>>> model.fit(hana_df, label='age',
...      features=['workclass', 'fnlwgt', 'education', 'education-num', 'marital-status'],
...      key='id')

Making a prediction

>>> applyout_df = model.predict(hana_df)
>>> print(applyout_df.head(5).collect())
          id  TRUE_LABEL  PREDICTED
0         30          49         42
1         63          48         42
2         66          36         42
3        110          42         42
4        335          53         42

Debriefing

>>> model.get_performance_metrics()
OrderedDict([('L1', 8.59885654599923), ('L2', 11.012352163260505)...
>>> model.get_feature_importances()
OrderedDict([('marital-status', 0.7916100739306074), ('education-num', 0.13524836400650087)

Saving the model

>>> model.save_model(schema_name='MySchema', table_name='MyTable',if_exists='replace')

Reloading the model and making another prediction

>>> model2 = AutoRegressor(conn_context=CONN)
>>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df)
>>> applyout2.head(5).collect()
          id  TRUE_LABEL  PREDICTED
0         30          49         42
1         63          48         42
2         66          36         42
3        110          42         42
4        335          53         42

Methods

fit(data[, key, features, label]) Fits the model.
get_feature_importances([method]) Gets the feature importances (MaximumSmartVariableContribution).
get_fit_operation_log() Retrieves the operation log table after the model training.
get_indicators() Retrieves the Indicator table after model training.
get_params() Retrieves attributes of the current object.
get_performance_metrics() Gets the model performance metrics of the last model training.
get_predict_operation_log() Retrieves the operation log table after the model training.
get_summary() Retrieves the summary table after model training.
load_model(schema_name, table_name[, oid]) Loads the model from a table.
predict(data) Makes prediction with a fitted model.
save_artifact(artifact_df, schema_name, …) Saves an artifact, a temporary table, into a permanent table.
save_model(schema_name, table_name[, …]) Saves the model into a table.
score(data) Returns the coefficient of determination R^2 of the prediction.
set_params(**parameters) Sets attributes of the current model.
fit(data, key=None, features=None, label=None)

Fits the model.

Parameters:
data : DataFrame

The training dataset

key : str, optional

The name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, default will be to all the non-ID and non-label columns.

label : str, optional

The name of the label column. Default is the last column.

Returns:
self : object

Notes

It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with a key, because the model will not expect it.

get_feature_importances(method=None)

Gets the feature importances (MaximumSmartVariableContribution).

Parameters:
method : str, optional

The method to be used to measure the feature contributions. It is not used for binary-classification and regression. It is only used for gradient boosting algorithm.

Returns:
feature importances : An OrderedDict { feature_name
get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns:
The reference to OPERATION_LOG table : hana_ml DataFrame
This table provides detailed logs of the last model training
get_indicators()

Retrieves the Indicator table after model training.

Returns:
The reference to INDICATORS table : hana_ml DataFrame
This table provides the performance metrics of the last model training
get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns:
The attribute-values of the model : dictionary
get_performance_metrics()

Gets the model performance metrics of the last model training.

Returns:
An OrderedDict with metric name as key and metric value as value.
For example:
rderedDict([(‘L1’, 8.59885654599923),

(‘L2’, 11.012352163260505), (‘LInf’, 67.0), (‘ErrorMean’, 0.33833594458645944), …

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns:
The reference to the OPERATION_LOG table : hana_ml DataFrame
This table provides detailed logs about the last prediction
get_summary()

Retrieves the summary table after model training.

Returns:
The reference to the SUMMARY table : hana_ml DataFrame
This contains execution summary of the last model training
load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters:
schema_name: str

The schema name

table_name: str

The table name

oid : str. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

predict(data)

Makes prediction with a fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.

Parameters:
data : hana_ml DataFrame

The dataset used for prediction

Returns:
Prediction output: a hana_ml DataFrame.
The dataframe contains the following columns:
- KEY : the key column if it was provided in the dataset
- TRUE_LABEL : the true value if it was provided in the dataset
- PREDICTED : the predicted value
save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters:
schema_name: str

The schema name

artifact_df : hana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
The behavior when the table already exists:
  • fail: Raises a ValueError
  • replace: Drops the table before inserting new values
  • append: Inserts new values to the existing table
new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters:
schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
The behavior when the table already exists:
  • fail: Raises an Error
  • replace: Drops the table before inserting new values
  • append: Inserts new values to the existing table
new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns:
None
The model is saved into a table with the following columns:
  • “OID” NVARCHAR(50), – Serve as ID
  • “FORMAT” NVARCHAR(50), – APL technical info
  • “LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model
score(data)

Returns the coefficient of determination R^2 of the prediction.

Parameters:
data : hana_ml DataFrame

The dataset used for prediction. It must contain the true value so that the score could be computed.

Returns:
mean average accuracy: float
set_params(**parameters)

Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.

Parameters:
params : dictionary

The attribute names and values