hana_ml.algorithms.apl package

The Algorithms APL Package consists of the following sections:

hana_ml.algorithms.apl.classification

This module provides the SAP HANA APL binary classification algorithm.

The following classes are available:

class hana_ml.algorithms.apl.classification.AutoClassifier(conn_context, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.robust_regression_base.RobustRegressionBase

SAP HANA APL Binary Classifier algorithm.

Parameters

conn_context : ConnectionContext

The connection object to an SAP HANA database

variable_auto_selection : bool, optional

When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables.

polynomial_degree : int, optional

The polynomial degree of the model. Default is 1.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.

extra_applyout_settings: dict optional

Defines other outputs the model should generate in addition to the predicted values. For example: {‘APL/ApplyReasonCode’:‘3;Mean;Below;False’} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Reference Guide.

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • ‘correlations_lower_bound’

  • ‘correlations_max_kept’

  • ‘cutting_strategy’

  • ‘exclude_low_predictive_confidence’

  • ‘risk_fitting’

  • ‘risk_fitting_min_cumulated_frequency’

  • ‘risk_fitting_nb_pdo’

  • ‘risk_fitting_use_weights’

  • ‘risk_gdo’

  • ‘risk_mode’

  • ‘risk_pdo’

  • ‘risk_score’

  • ‘score_bins_count’

  • ‘target_key’

  • ‘variable_selection_best_iteration’

  • ‘variable_selection_min_nb_of_final_variables’

  • ‘variable_selection_max_nb_of_final_variables’

  • ‘variable_selection_mode’

  • ‘variable_selection_nb_variables_removed_by_step’

  • ‘variable_selection_percentage_of_contribution_kept_by_step’

  • ‘variable_selection_quality_bar’

  • ‘variable_selection_quality_criteria’

See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

Notes

It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

By default, when it is not given, SAP HANA APL guesses the variable description by reading the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values in these parameters, the user can overwrite the default guess. For example:

model.set_params(
        variable_storages = {
            'ID': 'integer',
            'sepal length (cm)': 'number'
            })
model.set_params(
        variable_value_types = {
            'sepal length (cm)': 'continuous'
            })
model.set_params(
        variable_missing_strings = {
            'sepal length (cm)': '-1'
            })
model.set_params(
    extra_applyout_settings={
            'APL/ApplyReasonCode':'3;Mean;Below;False'
            })

Examples

>>> from hana_ml.algorithms.apl.classification import AutoClassifier
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates hana_ml DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoClassifier(conn_context=CONN, variable_auto_selection=True)
>>> model.fit(hana_df, label='class', key='id')

Making the predictions

>>> applyout_df = model.predict(hana_df)
>>> apply_out_df.collect() # returns the output as a pandas DataFrame
    id  TRUE_LABEL  PREDICTED  PROBABILITY
0   30           0          0     0.688153
1   63           0          0     0.677693
2   66           0          0     0.700221

Debriefing

>>> model.get_performance_metrics()
OrderedDict([('L1', 0.2522171212463023), ('L2', 0.32254434028379236), ...
>>> model.get_feature_importances()
OrderedDict([('marital-status', 0.2172766583204266), ('capital-gain', 0.19521247617062215),...

Saving the model

>>> model.save_model(schema_name='MySchema', table_name='MyTable',if_exists='replace')

Reloading model and predicting

>>> model2 = AutoClassifier(conn_context=CONN)
>>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df)
>>> applyout2
    id  TRUE_LABEL  PREDICTED  PROBABILITY
0   30           0          0     0.688153
1   63           0          0     0.677693
2   66           0          0     0.700221

Attributes

model_

(hana_ml DataFrame) The trained model content

summary_

(APLArtifactTable) The reference to the “SUMMARY” table generated by the model training. This table contains the content of the model training summary.

indicators_

(APLArtifactTable) The reference to the “INDICATORS” table generated by the model training. This table contains various metrics related to model and model variables.

fit_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table generated by the model training

var_desc_

(APLArtifactTable) The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training

applyout_

(hana_ml DataFrame) The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table when a prediction was made

Methods

fit(data[, key, features, label])

Fits the model.

get_feature_importances()

Returns the feature importances (MaximumSmartVariableContribution).

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data)

Makes predictions with the fitted model.

save_artifact(artifact_df, schema_name, …)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, …])

Saves the model into a table.

score(data)

Returns the mean accuracy on the provided test dataset.

set_params(**parameters)

Sets attributes of the current model.

fit(data, key=None, features=None, label=None)

Fits the model.

Parameters

data : DataFrame

The training dataset

key : str, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended usage. See notes below.

features : list of str, optional

The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.

label : str, optional

The name of the label column. Default is the last column.

Returns

self : object

Notes

It is highly recommended to use a dataset with key in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with key, because the model will not expect it.

predict(data)

Makes predictions with the fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.

Parameters

data : hana_ml DataFrame

The dataset used for prediction

Returns

Prediction output: hana_ml DataFrame

The dataframe contains the following columns:

- KEY : the key column if it was provided in the dataset

- TRUE_LABEL : the class label when it was given in the dataset

- PREDICTED : the predicted label

- PROBABILITY : the probability of the predicted label to be correct (confidence)

- SCORING_VALUE : the unnormalized scoring value

score(data)

Returns the mean accuracy on the provided test dataset.

Parameters

data : hana_ml DataFrame

The test dataset used to compute the score. The labels must be provided in the dataset.

Returns

mean average accuracy: float

get_feature_importances()

Returns the feature importances (MaximumSmartVariableContribution).

Returns

feature importances : An OrderedDict { feature_name

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns

The reference to OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns

The reference to INDICATORS table : hana_ml DataFrame

This table provides the performance metrics of the last model training

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns

The attribute-values of the model : dictionary

get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns

An OrderedDict with metric name as key and metric value as value.

For example:

OrderedDict([(‘L1’, 8.59885654599923),

(‘L2’, 11.012352163260505), (‘LInf’, 67.0), (‘ErrorMean’, 0.33833594458645944), …

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns

The reference to the OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns

The reference to the SUMMARY table : hana_ml DataFrame

This contains execution summary of the last model training

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters

schema_name: str

The schema name

table_name: str

The table name

oid : str. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters

schema_name: str

The schema name

artifact_df : hana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters

schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns

None

The model is saved into a table with the following columns:

  • “OID” NVARCHAR(50), – Serve as ID

  • “FORMAT” NVARCHAR(50), – APL technical info

  • “LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model

set_params(**parameters)

Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.

Parameters

params : dictionary

The attribute names and values

hana_ml.algorithms.apl.regression

This module contains SAP HANA APL regression algorithm.

The following classes are available:

class hana_ml.algorithms.apl.regression.AutoRegressor(conn_context, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.robust_regression_base.RobustRegressionBase

This module provides the SAP HANA APL regression algorithm.

Parameters

conn_context : ConnectionContext

The connection object to an SAP HANA database

variable_auto_selection : bool optional

When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables

polynomial_degree : int optional

The polynomial degree of the model. Default is 1.

variable_storages: dict optional

Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.

variable_value_types: dict optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.

extra_applyout_settings: dict optional

Defines other outputs the model should generate in addition to the predicted values. For example: {‘APL/ApplyReasonCode’:‘3;Mean;Below;False’} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Reference Guide.

other_params: dict optional

Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • ‘correlations_lower_bound’

  • ‘correlations_max_kept’

  • ‘cutting_strategy’

  • ‘exclude_low_predictive_confidence’

  • ‘risk_fitting’

  • ‘risk_fitting_min_cumulated_frequency’

  • ‘risk_fitting_nb_pdo’

  • ‘risk_fitting_use_weights’

  • ‘risk_gdo’

  • ‘risk_mode’

  • ‘risk_pdo’

  • ‘risk_score’

  • ‘score_bins_count’

  • ‘variable_auto_selection’

  • ‘variable_selection_best_iteration’

  • ‘variable_selection_min_nb_of_final_variables’

  • ‘variable_selection_max_nb_of_final_variables’

  • ‘variable_selection_mode’

  • ‘variable_selection_nb_variables_removed_by_step’

  • ‘variable_selection_percentage_of_contribution_kept_by_step’

  • ‘variable_selection_quality_bar’

  • ‘variable_selection_quality_criteria’

See Common APL Aliases for Model Training in SAP HANA APL Reference Guide.

Notes

It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

Examples

>>> from hana_ml.algorithms.apl.regression import AutoRegressor
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA Database

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates Hana DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoRegressor(conn_context=CONN, variable_auto_selection=True)
>>> model.fit(hana_df, label='age',
...      features=['workclass', 'fnlwgt', 'education', 'education-num', 'marital-status'],
...      key='id')

Making a prediction

>>> applyout_df = model.predict(hana_df)
>>> print(applyout_df.head(5).collect())
          id  TRUE_LABEL  PREDICTED
0         30          49         42
1         63          48         42
2         66          36         42
3        110          42         42
4        335          53         42

Debriefing

>>> model.get_performance_metrics()
OrderedDict([('L1', 8.59885654599923), ('L2', 11.012352163260505)...
>>> model.get_feature_importances()
OrderedDict([('marital-status', 0.7916100739306074), ('education-num', 0.13524836400650087)

Saving the model

>>> model.save_model(schema_name='MySchema', table_name='MyTable',if_exists='replace')

Reloading the model and making another prediction

>>> model2 = AutoRegressor(conn_context=CONN)
>>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df)
>>> applyout2.head(5).collect()
          id  TRUE_LABEL  PREDICTED
0         30          49         42
1         63          48         42
2         66          36         42
3        110          42         42
4        335          53         42

Methods

fit(data[, key, features, label])

Fits the model.

get_feature_importances()

Returns the feature importances (MaximumSmartVariableContribution).

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data)

Makes prediction with a fitted model.

save_artifact(artifact_df, schema_name, …)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, …])

Saves the model into a table.

score(data)

Returns the coefficient of determination R^2 of the prediction.

set_params(**parameters)

Sets attributes of the current model.

fit(data, key=None, features=None, label=None)

Fits the model.

Parameters

data : DataFrame

The training dataset

key : str, optional

The name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, default will be to all the non-ID and non-label columns.

label : str, optional

The name of the label column. Default is the last column.

Returns

self : object

Notes

It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with a key, because the model will not expect it.

predict(data)

Makes prediction with a fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.

Parameters

data : hana_ml DataFrame

The dataset used for prediction

Returns

Prediction output: a hana_ml DataFrame.

The dataframe contains the following columns:

- KEY : the key column if it was provided in the dataset

- TRUE_LABEL : the true value if it was provided in the dataset

- PREDICTED : the predicted value

score(data)

Returns the coefficient of determination R^2 of the prediction.

Parameters

data : hana_ml DataFrame

The dataset used for prediction. It must contain the true value so that the score could be computed.

Returns

mean average accuracy: float

get_feature_importances()

Returns the feature importances (MaximumSmartVariableContribution).

Returns

feature importances : An OrderedDict { feature_name

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns

The reference to OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns

The reference to INDICATORS table : hana_ml DataFrame

This table provides the performance metrics of the last model training

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns

The attribute-values of the model : dictionary

get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns

An OrderedDict with metric name as key and metric value as value.

For example:

OrderedDict([(‘L1’, 8.59885654599923),

(‘L2’, 11.012352163260505), (‘LInf’, 67.0), (‘ErrorMean’, 0.33833594458645944), …

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns

The reference to the OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns

The reference to the SUMMARY table : hana_ml DataFrame

This contains execution summary of the last model training

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters

schema_name: str

The schema name

table_name: str

The table name

oid : str. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters

schema_name: str

The schema name

artifact_df : hana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters

schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns

None

The model is saved into a table with the following columns:

  • “OID” NVARCHAR(50), – Serve as ID

  • “FORMAT” NVARCHAR(50), – APL technical info

  • “LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model

set_params(**parameters)

Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.

Parameters

params : dictionary

The attribute names and values

hana_ml.algorithms.apl.clustering

This module provides the SAP HANA APL clustering algorithms.

The following classes are available:

class hana_ml.algorithms.apl.clustering.AutoUnsupervisedClustering(conn_context, nb_clusters=None, nb_clusters_min=None, nb_clusters_max=None, distance=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.clustering._AutoClusteringBase

SAP HANA APL unsupervised clustering algorithm.

Parameters

nb_clusters : int, optional, default = 10

The number of clusters to create

nb_clusters_min: int, optional

The minimum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.

nb_clusters_max: int, optional

The maximum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.

distance: str, optional, default = ‘SystemDetermined’

The metric used to measure the distance between data points. The possible values are: ‘L1’, ‘L2’, ‘LInf’, ‘SystemDetermined’.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value types (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals ‘???’, it will be taken as missing.

extra_applyout_settings: dict optional

Defines the output to generate when applying the model. See documentation on predict() method for more information.

other_params: dict optional

Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • calculate_cross_statistics

  • calculate_sql_expressions

  • cutting_strategy

  • encoding_strategy

See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

Notes

  • The algorithm may detect less clusters than requested.

This happens when a cluster detected on the estimation dataset was not found on the validation dataset. In that case, this cluster will be considered unstable and will then be removed from the model. Users can get the number of clusters actually found in the “INDICATORS” table. For example,

# The actual number of clusters found
d = model_u.get_indicators().collect()
d[d.KEY=='FinalClusterCount'][['KEY','VALUE']]
  • It is highly recommended to use a dataset with a key provided in the fit() method.

If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

  • By default, when it is not given, SAP HANA APL guesses the variable description by reading

the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values for these parameters, the user can overwrite the default guess. For example:

model.set_params(
        variable_storages = {
            'ID': 'integer',
            'sepal length (cm)': 'number'
            })
model.set_params(
        variable_value_types = {
            'sepal length (cm)': 'continuous'
            })
model.set_params(
        variable_missing_strings = {
            'sepal length (cm)': '-1'
            })

Examples

>>> from hana_ml.algorithms.apl.clustering import AutoUnsupervisedClustering
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates Hana DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoUnsupervisedClustering(CONN, nb_clusters=5)
>>> model.fit(data=hana_df, key='id')

Debriefing

>>> model.get_metrics()
OrderedDict([('SimplifiedSilhouette', 0.3448029020802121), ('RSS', 4675.706587754118),...
>>> model.get_metrics_by_cluster()
{'Frequency': {1: 0.23053242076908276,
      2: 0.27434649954646656,
      3: 0.09628652318517908,
      4: 0.29919463456199663,
      5: 0.09963992193727494},
     'IntraInertia': {1: 0.6734978174937322,
      2: 0.7202839995396123,
      3: 0.5516800856975772,
      4: 0.6969632183111357,
      5: 0.5809322138167139},
     'RSS': {1: 5648.626195319932,
      2: 7189.15459940487,
      3: 1932.5353401986129,
      4: 7586.444631316713,
      5: 2105.879275085588},
     'SimplifiedSilhouette': {1: 0.1383827622819234,
      2: 0.14716862328457128,
      3: 0.18753797605134545,
      4: 0.13679980173383793,
      5: 0.15481377834381388},
     'KL': {1: OrderedDict([('relationship', 0.4951910610641741),
                   ('marital-status', 0.2776259711735807),
                   ('hours-per-week', 0.20990189265572687),
                   ('education-num', 0.1996353893520096),
                   ('education', 0.19963538935200956),
                   ...

Predicting which cluster a data point belongs to

>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  CLOSEST_CLUSTER_1  DISTANCE_TO_CLOSEST_CENTROID_1
0   30                  3                        0.640378
1   63                  4                        0.611050
2   66                  3                        0.640378
3  110                  4                        0.611050
4  335                  1                        0.851054

Determining the 2 closest clusters

>>> model.set_params(extra_applyout_settings={'mode':'closest_distances', 'nb_distances': 2})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  CLOSEST_CLUSTER_1  ...  CLOSEST_CLUSTER_2  DISTANCE_TO_CLOSEST_CENTROID_2
0   30                  3  ...                  4                        0.730330
1   63                  4  ...                  1                        0.851054
2   66                  3  ...                  4                        0.730330
3  110                  4  ...                  1                        0.851054
4  335                  1  ...                  4                        0.906003

Retrieving the distances to all clusters

>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  DISTANCE_TO_CENTROID_1               ... DISTANCE_TO_CENTROID_5
0   30                  3               ...      1.160697
1   63                  4               ...      1.160697
2   66                  3               ...      1.160697

Saving the model

>>> model.save_model(schema_name='MySchema', table_name='MyTable',if_exists='replace')

Reloading the model for further use

>>> model2 = AutoUnsupervisedClustering(conn_context=CONN)
>>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df)
>>> applyout2.head(3).collect()
    id  CLOSEST_CLUSTER_1  DISTANCE_TO_CLOSEST_CENTROID_1
0   30                  3                        0.640378
1   63                  4                        0.611050
2   66                  3                        0.640378

Attributes

model_

(hana_ml DataFrame) The trained model content

summary_

(APLArtifactTable) The reference to the “SUMMARY” table generated by the model training. This table contains the summary about the model training.

indicators_

(APLArtifactTable) The reference to the “INDICATORS” table generated by the model training. This table contains the various metrics related to the model and its variables.

fit_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table generated by the model training

var_desc_

(APLArtifactTable) The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training

applyout_

(hana_ml DataFrame) The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table when a prediction was made

Methods

fit(data[, key, features])

Fits the model.

fit_predict(data[, key, features])

Fits a clustering model and uses it to generate prediction output on the training dataset.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_metrics()

Returns a dictionary containing the metrics about the model.

get_params()

Retrieves attributes of the current object.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data)

Predicts which cluster each specified row belongs to.

save_artifact(artifact_df, schema_name, …)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, …])

Saves the model into a table.

set_params(**parameters)

Sets attributes of the current model.

fit(data, key=None, features=None)

Fits the model.

Parameters

data : hana_ml DataFrame

The training dataset

key : str, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended.

features : list of str, optional

The names of the features to be used in the model. If features is not provided, all columns will be taken except the ID column.

Returns

self : object

fit_predict(data, key=None, features=None)

Fits a clustering model and uses it to generate prediction output on the training dataset.

Parameters

data : hana_ml DataFrame

The input dataset

key : str, optional

The name of the ID column.

features : list of str, optional.

The names of the feature columns. If features is not provided, all non-ID columns will be taken.

Returns

hana_ml DataFrame.

The output is the same as the predict() method.

Notes

Please see the predict() method so as to get different outputs with the ‘extra_applyout_settings’ parameter.

get_metrics()

Returns a dictionary containing the metrics about the model.

Returns

A dictionary object containing a set of clustering metrics and their values

Examples

>>> model.get_metrics()
{'SimplifiedSilhouette': 0.14668968897882997,
 'RSS': 24462.640041325714,
 'IntraInertia': 3.2233573348587714,
 'KL': {1: OrderedDict([('hours-per-week', 0.2971627592049324),
             ('occupation', 0.11944355994892383),
             ('relationship', 0.06772624975990414),
             ('education-num', 0.06377345492340795),
             ('education', 0.06377345492340793),
             ...}
get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns

The reference to OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns

The reference to INDICATORS table : hana_ml DataFrame

This table provides the performance metrics of the last model training

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns

The attribute-values of the model : dictionary

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns

The reference to the OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns

The reference to the SUMMARY table : hana_ml DataFrame

This contains execution summary of the last model training

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters

schema_name: str

The schema name

table_name: str

The table name

oid : str. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

predict(data)

Predicts which cluster each specified row belongs to.

Parameters

data : hana_ml DataFrame

The set of rows for which to generate cluster predictions. This dataset must have the same structure as the one used in the fit() method.

Returns

hana_ml DataFrame

By default, the ID of the closest cluster and the distance to its center are provided. Users can request different outputs by setting the extra_applyout_settings parameter in the model. The extra_applyout_settings parameter is a dictionary with ‘mode’ and ‘nb_distances’ as keys. If mode is set to ‘closest_distances’, cluster IDs and distances to centroids will be provided from the closest to the furthest cluster. The output columns will be:

  • <The key column name>,

  • CLOSEST_CLUSTER_1,

  • DISTANCE_TO_CLOSEST_CENTROID_1,

  • CLOSEST_CLUSTER_2,

  • DISTANCE_TO_CLOSEST_CENTROID_2,

If mode is set to ‘all_distances’, the distances to the centroids of all clusters will be provided in cluster ID order. The output columns will be:

  • ID,

  • DISTANCE_TO_CENTROID_1,

  • DISTANCE_TO_CENTROID_2,

nb_distances limits the output to the closest clusters. It is only valid when mode is ‘closest_distances’ (it will be ignored if mode = ‘all distances’). It can be set to ‘all’ or a positive integer.

Examples

Retrieves the IDs of the 3 closest clusters and the distances to their centroids:

>>> extra_applyout_settings = {'mode': 'closest_distances', 'nb_distances': '3'}
>>> model.set_params(extra_applyout_settings=extra_applyout_settings)
>>> out = model.predict(hana_df)
>>> out.head(3).collect()
            id  CLOSEST_CLUSTER_1  ...  CLOSEST_CLUSTER_3  DISTANCE_TO_CLOSEST_CENTROID_3
0   30                  3  ...                  4                        0.730330
1   63                  4  ...                  1                        0.851054
2   66                  3  ...                  4                        0.730330

Retrieves the distances to all clusters:

>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'})
>>> out = model.predict(hana_df)
>>> out.head(3).collect()
   id  DISTANCE_TO_CENTROID_1  DISTANCE_TO_CENTROID_2  ... DISTANCE_TO_CENTROID_5
0  30                0.994595                0.877414  ...              0.782949
1  63                0.994595                0.985202  ...              0.782949
2  66                0.994595                0.877414  ...              0.782949
save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters

schema_name: str

The schema name

artifact_df : hana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters

schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns

None

The model is saved into a table with the following columns:

  • “OID” NVARCHAR(50), – Serve as ID

  • “FORMAT” NVARCHAR(50), – APL technical info

  • “LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model

set_params(**parameters)

Sets attributes of the current model.

Parameters

params : dictionary

The set of parameters with their new values

class hana_ml.algorithms.apl.clustering.AutoSupervisedClustering(conn_context, label=None, nb_clusters=None, nb_clusters_min=None, nb_clusters_max=None, distance=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.clustering._AutoClusteringBase

SAP HANA APL Supervised Clustering algorithm. Clusters are determined with respect to a label variable.

Parameters

label: str,

The name of the label column

nb_clusters : int, optional, default = 10

The number of clusters to create

nb_clusters_min: int, optional

The minimum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.

nb_clusters_max: int, optional

The maximum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.

distance: str, optional, default = ‘SystemDetermined’

The metric used to measure the distance between data points. The possible values are: ‘L1’, ‘L2’, ‘LInf’, ‘SystemDetermined’.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value types (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals ‘???’, it will be taken as missing.

extra_applyout_settings: dict optional

Defines the output to generate when applying the model. See documentation on predict() method for more information.

other_params: dict optional

Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:

  • calculate_cross_statistics

  • calculate_sql_expressions

  • cutting_strategy

  • encoding_strategy

See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

Notes

  • The algorithm may detect less clusters than requested.

This happens when a cluster detected on the estimation dataset was not found on the validation dataset. In that case, this cluster will be considered unstable and will then be removed from the model. Users can get the number of clusters actually found in the “INDICATORS” table. For example,

# The actual number of clusters found
d = model_u.get_indicators().collect()
d[d.KEY=='FinalClusterCount'][['KEY','VALUE']]
  • It is highly recommended to use a dataset with a key provided in the fit() method.

If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.

  • By default, when it is not given, SAP HANA APL guesses the variable description by reading

the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values for these parameters, the user can overwrite the default guess. For example:

model.set_params(
        variable_storages = {
            'ID': 'integer',
            'sepal length (cm)': 'number'
            })
model.set_params(
        variable_value_types = {
            'sepal length (cm)': 'continuous'
            })
model.set_params(
        variable_missing_strings = {
            'sepal length (cm)': '-1'
            })

Examples

>>> from hana_ml.algorithms.apl.clustering import AutoSupervisedClustering
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS')
>>> # -- Creates Hana DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = AutoSupervisedClustering(CONN, nb_clusters=5)
>>> model.fit(data=hana_df, key='id', label='class')

Debriefing

>>> model.get_metrics()
OrderedDict([('SimplifiedSilhouette', 0.3448029020802121), ('RSS', 4675.706587754118),...
>>> model.get_metrics_by_cluster()
{'Frequency': {1: 0.15139770759462357,
  2: 0.39707539649817214,
  3: 0.21549710013468568,
  4: 0.12949066820593166,
  5: 0.10653912756658696},
 'IntraInertia': {1: 0.1604412809425719,
  2: 0.10561882166246073,
  3: 0.12004212490063185,
  4: 0.21030892961293207,
  5: 0.08625667904000194},
 'RSS': {1: 883.710575431686,
  2: 1525.7694977359076,
  3: 941.1302592209537,
  4: 990.765367406523,
  5: 334.3308879590475},
 'SimplifiedSilhouette': {1: 0.3355726073943343,
  2: 0.4231738907945281,
  3: 0.2448648428415369,
  4: 0.38136325589137554,
  5: 0.22353657540054947},
 'TargetMean': {1: 0.1744734931009441,
  2: 0.022912917070469333,
  3: 0.3895408163265306,
  4: 0.7537677775419231,
  5: 0.21207430340557276},
 'TargetStandardDeviation': {1: 0.37951613049526484,
  2: 0.14962591788119842,
  3: 0.48764615116105525,
  4: 0.4308154072006165,
  5: 0.40877719266198526},
 'KL': {1: OrderedDict([('relationship', 0.6840012706191696),
               ('education', 0.675109873839992),
               ('education-num', 0.6751098738399919),
               ('marital-status', 0.5806503390741476),
               ('occupation', 0.46891689485806354),
               ('sex', 0.08802303491483551),
               ('capital-gain', 0.08794254258565125),
               ...

Predicting which cluster a data point belongs to

>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  CLOSEST_CLUSTER_1  DISTANCE_TO_CLOSEST_CENTROID_1
0   30                  3                        0.640378
1   63                  4                        0.611050
2   66                  3                        0.640378
3  110                  4                        0.611050
4  335                  1                        0.851054

Determining the 2 closest clusters

>>> model.set_params(extra_applyout_settings={'mode':'closest_distances', 'nb_distances': 2})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  CLOSEST_CLUSTER_1  ...  CLOSEST_CLUSTER_2  DISTANCE_TO_CLOSEST_CENTROID_2
0   30                  3  ...                  4                        0.730330
1   63                  4  ...                  1                        0.851054
2   66                  3  ...                  4                        0.730330
3  110                  4  ...                  1                        0.851054
4  335                  1  ...                  4                        0.906003

Retrieving the distances to all clusters

>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect() # returns the output as a pandas DataFrame
    id  DISTANCE_TO_CENTROID_1               ... DISTANCE_TO_CENTROID_5
0   30                0.851054               ...      1.160697
1   63                0.751054               ...      1.160697
2   66                0.906003               ...      1.160697

Saving the model

>>> model.save_model(schema_name='MySchema', table_name='MyTable',if_exists='replace')

Reloading the model for further uses Please note that the label has to be specified again prior to calling predict()

>>> model2 = AutoSupervisedClustering(conn_context=CONN)
>>> model2.set_params(label='class')
>>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df)
>>> applyout2.head(3).collect()
    id  CLOSEST_CLUSTER_1  DISTANCE_TO_CLOSEST_CENTROID_1
0   30                  3                        0.640378
1   63                  4                        0.611050
2   66                  3                        0.640378

Attributes

model_

(hana_ml DataFrame) The trained model content

summary_

(APLArtifactTable) The reference to the “SUMMARY” table generated by the model training. This table contains the summary about the model training.

indicators_

(APLArtifactTable) The reference to the “INDICATORS” table generated by the model training. This table contains the various metrics related to the model and its variables.

fit_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table generated by the model training

var_desc_

(APLArtifactTable) The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training

applyout_

(hana_ml DataFrame) The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table when a prediction was made

Methods

fit(data[, key, label, features])

Fits the model.

fit_predict(data[, key, label, features])

Fits a clustering model and uses it to generate prediction output on the training dataset.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_metrics()

Returns a dictionary containing the metrics about the model.

get_params()

Retrieves attributes of the current object.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data)

Predicts which cluster each specified row belongs to.

save_artifact(artifact_df, schema_name, …)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, …])

Saves the model into a table.

set_params(**parameters)

Sets attributes of the current model

set_params(**parameters)

Sets attributes of the current model

Parameters

params : dictionary

containing attribute names and values

fit(data, key=None, label=None, features=None)

Fits the model.

Parameters

data : hana_ml DataFrame

The training dataset

key : str, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended.

label : str, option

The name of the label column. If it is not given, the model ‘label’ attribute will be taken. If this latter is not defined, an error will be raised.

features : list of str, optional

The names of the features to be used in the model. If features is not provided, all columns will be taken except the ID and the label columns.

Returns

self : object

predict(data)

Predicts which cluster each specified row belongs to.

Parameters

data : hana_ml DataFrame

The set of rows for which to generate cluster predictions. This dataset must have the same structure as the one used in the fit() method.

Returns

hana_ml DataFrame

By default, the ID of the closest cluster and the distance to its center are provided. Users can request different outputs by setting the extra_applyout_settings parameter in the model. The extra_applyout_settings parameter is a dictionary with ‘mode’ and ‘nb_distances’ as keys. If mode is set to ‘closest_distances’, cluster IDs and distances to centroids will be provided from the closest to the furthest cluster. The output columns will be:

  • <The key column name>,

  • CLOSEST_CLUSTER_1,

  • DISTANCE_TO_CLOSEST_CENTROID_1,

  • CLOSEST_CLUSTER_2,

  • DISTANCE_TO_CLOSEST_CENTROID_2,

If mode is set to ‘all_distances’, the distances to the centroids of all clusters will be provided in cluster ID order. The output columns will be:

  • ID,

  • DISTANCE_TO_CENTROID_1,

  • DISTANCE_TO_CENTROID_2,

nb_distances limits the output to the closest clusters. It is only valid when mode is ‘closest_distances’ (it will be ignored if mode = ‘all distances’). It can be set to ‘all’ or a positive integer.

Examples

Retrieves the IDs of the 3 closest clusters and the distances to their centroids:

>>> extra_applyout_settings = {'mode': 'closest_distances', 'nb_distances': 3}
>>> model.set_params(extra_applyout_settings=extra_applyout_settings)
>>> out = model.predict(hana_df)
>>> out.head(3).collect()
    id  CLOSEST_CLUSTER_1  ...  CLOSEST_CLUSTER_3  DISTANCE_TO_CLOSEST_CENTROID_3
0   30                  3  ...                  4                        0.730330
1   63                  4  ...                  1                        0.851054
2   66                  3  ...                  4                        0.730330

Retrieves the distances to all clusters:

>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'})
>>> out = model.predict(hana_df)
>>> out.head(3).collect()
   id  DISTANCE_TO_CENTROID_1  DISTANCE_TO_CENTROID_2  ... DISTANCE_TO_CENTROID_5
0  30                0.994595                0.877414  ...              0.782949
1  63                0.994595                0.985202  ...              0.782949
2  66                0.994595                0.877414  ...              0.782949
fit_predict(data, key=None, label=None, features=None)

Fits a clustering model and uses it to generate prediction output on the training dataset.

Parameters

data : hana_ml DataFrame

The input dataset

key : str, optional

The name of the ID column

label : str

The name of the label column

features : list of str, optional.

The names of the feature columns. If features is not provided, all non-ID and non-label columns will be taken.

Returns

hana_ml DataFrame.

The output is the same as the predict() method.

Notes

Please see the predict() method so as to get different outputs with the ‘extra_applyout_settings’ parameter.

get_metrics()

Returns a dictionary containing the metrics about the model.

Returns

A dictionary object containing a set of clustering metrics and their values

Examples

>>> model.get_metrics()
{'SimplifiedSilhouette': 0.14668968897882997,
 'RSS': 24462.640041325714,
 'IntraInertia': 3.2233573348587714,
 'Frequency': {
    1: 0.3167862345729914,
    2: 0.35590005772243755,
    3: 0.3273137077045711},
 'IntraInertia': {1: 0.7450335510518645,
     2: 0.708350629565789,
     3: 0.7006679558645009},
 'RSS': {1: 8586.511675872738,
     2: 9171.723951617836,
     3: 8343.554018434477},
 'SimplifiedSilhouette': {1: 0.13324659043317924,
     2: 0.14182734764281074,
     3: 0.1311620470933516},
 'TargetMean': {1: 0.1744734931009441,
      2: 0.022912917070469333,
      3: 0.3895408163265306},
 'TargetStandardDeviation': {1: 0.37951613049526484,
      2: 0.14962591788119842,
      3: 0.48764615116105525},
 'KL': {1: OrderedDict([('hours-per-week', 0.2971627592049324),
             ('occupation', 0.11944355994892383),
             ('relationship', 0.06772624975990414),
             ('education-num', 0.06377345492340795),
             ('education', 0.06377345492340793),
             ...
load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters

schema_name: str

The schema name

table_name: str

The table name

oid : str. optional

If the table contains several models, the OID must be given as an identifier. If it is not provided, the whole table is read.

Notice

——

Prior to using a reloaded model for a new prediction, it is necessary to re-specify

the ‘label’ parameter. Otherwise, the predict() method will fail.

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns

The reference to OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns

The reference to INDICATORS table : hana_ml DataFrame

This table provides the performance metrics of the last model training

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns

The attribute-values of the model : dictionary

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns

The reference to the OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns

The reference to the SUMMARY table : hana_ml DataFrame

This contains execution summary of the last model training

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters

schema_name: str

The schema name

artifact_df : hana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters

schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns

None

The model is saved into a table with the following columns:

  • “OID” NVARCHAR(50), – Serve as ID

  • “FORMAT” NVARCHAR(50), – APL technical info

  • “LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model

hana_ml.algorithms.apl.time_series

This module contains the SAP HANA APL Time Series algorithm.

The following class is available:

class hana_ml.algorithms.apl.time_series.AutoTimeSeries(conn_context, time_column_name=None, target=None, horizon=1, with_extra_predictable=True, last_training_time_point=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, train_data_=None, **other_params)

Bases: hana_ml.algorithms.apl.apl_base.APLBase

SAP HANA APL Time Series algorithm.

target: str

The name of the column containing the time series data points.

time_column_name: str

The name of the column containing the time series time points. The time column is used as table key. It can be overridden by setting the ‘key’ parameter through the fit() method.

last_training_time_point: str, optional

The last time point used for model training. The training dataset will contain all data points up to this date. By default, this parameter will be set as the last time point until which the target is not null.

horizon: int, optional

The number of forecasts to be generated by the model upon apply. The time series model will be trained to optimize accuracy on the requested horizon only. The default value is 1.

with_extra_predictable: bool, optional

If set to true, all input variables will be used by the model to generate forecasts. If set to false, only the time and target columns will be used. All other variables will be ignored. This parameter is set to true by default.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value types (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals ‘???’, it will be taken as missing.

extra_applyout_settings: dict, optional

Specifies the prediction outputs. See documentation on predict() method for more details.

other_params: dict, optional

Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are: - force_negative_forecast - force_positive_forecast - forecast_fallback_method - forecast_max_cyclics - forecast_max_lags - forecast_method - smoothing_cycle_length See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.

Notes

The input dataset, given as an hana_ml dataframe, must not be a temporary table because the API tries to create a view sorted by the time column. SAP HANA does not allow user to create a view on temporary table.

When calling the fit_predict() method, the time series model is generated on the fly and not returned. If a model must be saved, please consider using the fit() method instead.

When extra-predictable variables are involved, it is usual to have a single dataset used both for the model training and the forecasting. In this case, the dataset should contain two successive periods:

The first one is used for the model training, ranging from the beginning to the last date where the target value is not null.

The second one is used for the model training, ranging from the the first date where the target value is null.

The content of the output of the get_performance_metrics() method may change depending of the version of SAP HANA APL used with this API. Please refer to the SAP HANA APL documentation to know which metrics will be provided.

Examples

>>> from hana_ml.algorithms.apl.time_series import AutoTimeSeries
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS)
>>> # -- Creates Hana DataFrame
>>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CASHFLOWS_FULL')

Creating and fitting the model

>>> model = AutoTimeSeries(CONN, time_column_name='Date', target='Cash', horizon=3)
>>> model.fit(data=hana_df)

Debriefing

>>> model.get_model_components()
{'Trend': 'Polynom( Date)',
 'Cycles': 'PeriodicExtrasPred_MondayMonthInd',
 'Fluctuations': 'AR(46)'}
>>> model.get_performance_metrics()
{'MAPE': [0.12853715702893018, 0.12789963348617622, 0.12969031859857874], ...}

Generating forecasts using the forecast() method This method is used to generate forecasts using a signature similar to the one used in PAL. There are two variants of usage as described below:

1) If the model does not use extra-predictable variables (no exogenous variable), users must simply specify the number of forecasts.

>>> train_df = DataFrame(CONN,
                        'SELECT "Date" , "Cash" '
                        'from APL_SAMPLES.CASHFLOWS_FULL ORDER BY 1 LIMIT 100')
>>> model = AutoTimeSeries(CONN, time_column_name='Date', target='Cash', horizon=3)
>>> model.fit(train_df)
>>> out = model.forecast(forecast_length=3)
>>> out.collect().tail(5)
           Date                            ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
98   2001-05-23  3057.812544999999772699132909775  4593.966530              NaN              NaN
99   2001-05-25  3037.539714999999887176132440567  4307.893346              NaN              NaN
100  2001-05-26                              None  4206.023158     -3609.599872     12021.646187
101  2001-05-27                              None  4575.162651     -3392.283802     12542.609104
102  2001-05-28                              None  4830.352462     -3239.507360     12900.212284

2) If the model uses extra-predictable variables, users must provide the values of all extra-predictable variables for each time point of the forecast period. These values must be provided as a hana_ml dataframe with the same structure as the training dataset.

>>> # Trains the dataset with extra-predictable variables
>>> train_df = DataFrame(CONN,
...                     'SELECT * '
...                     'from APL_SAMPLES.CASHFLOWS_FULL '
...                     'WHERE "Cash" is not null')
>>> # Extra-predictable variables' values on the forecast period
>>> forecast_df = DataFrame(CONN,
...                        'SELECT * '
...                        'from APL_SAMPLES.CASHFLOWS_FULL '
...                        'WHERE "Cash" is null LIMIT 5')
>>> model = AutoTimeSeries(CONN, time_column_name='Date', target='Cash', horizon=3)
>>> model.fit(train_df)
>>> out = model.forecast(data=forecast_df)
>>> out.collect().tail(5)
          Date ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
251  2001-12-29   None  6864.371407      -224.079492     13952.822306
252  2001-12-30   None  6889.515324      -211.264912     13990.295559
253  2001-12-31   None  6914.766513      -187.180923     14016.713949
254  2002-01-01   None  6940.124974              NaN              NaN
255  2002-01-02   None  6965.590706              NaN              NaN

Generating forecasts with the predict() method. The predict() method allows users to apply a fitted model on a dataset different from the training dataset. For example, users can train a dataset on the first quarter (January to March) and apply the model on a dataset of different period (March to May).

>>> # Trains the model on the first quarter, from January to March
>>> train_df = DataFrame(CONN,
...                     'SELECT "Date" , "Cash" '
...                     'from APL_SAMPLES.CASHFLOWS_FULL '
...                     "where "Date" between '2001-01-01' and '2001-03-31'"
...                     " ORDER BY 1")
>>> model.fit(train_df)
>>> # Forecasts on a shifted period, from March to May
>>> test_df = DataFrame(CONN,
...                    'SELECT "Date", "Cash" '
...                    'from APL_SAMPLES.CASHFLOWS_FULL '
...                    "where "Date" between '2001-03-01' and '2001-05-31'"
...                    " ORDER BY 1")
>>> out = model.predict(test_df)
>>> out.collect().tail(5)
          Date                            ACTUAL     PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
60  2001-05-30  3837.196734000000105879735597214   4630.223083              NaN              NaN
61  2001-05-31  2911.884261000000151398126928726   4635.265982              NaN              NaN
62  2001-06-01                              None   4538.516542     -1087.461104     10164.494188
63  2001-06-02                              None   4848.815364     -5090.167255     14787.797983
64  2001-06-03                              None   4853.858263     -5138.553275     14846.269801

Using the fit_predict() method This method enables the user to fit a model and generate forecasts on a single call, and thus get results faster. However, the model is created on the fly and deleted after use, so the user will not be able to save the resulting model.

>>> model.fit_predict(hana_df)
>>> out.collect().tail(5)
           Date            ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.42329499999  6055.761105              NaN              NaN
250  2001-12-28  7111.41669699999  6314.336098              NaN              NaN
251  2002-01-03                           None  7033.880804      4529.462710      9538.298899
252  2002-01-04                           None  6464.557223      3965.343397      8963.771049
253  2002-01-07                           None  6469.141663      3961.414900      8976.868427

Breaking down the time series into trend, cycles, fluctuations and residuals components. If the parameter extra_applyout_settings is set to {‘ExtraMode’: True}, anytime a forecast method is called, predict(), forecast() or fit_predict(), the output will contain time series components and their corresponding residuals. The prediction columns are suffixed by the horizon number. For instance, ‘Cycles_RESIDUALS_3’ means the residual of the cycle component in the third horizon.

>>> model.fit(train_df)
>>> model.set_params(extra_applyout_settings={'ExtraMode': True})
>>> out = model.predict(hana_df)
>>> out.collect().tail(5)
               Date              ACTUAL        ...  Cycles_RESIDUALS_3  Fluctuations_RESIDUALS_3
249  2001-12-27  5995.42329499392507553        ...               32.51                  4.48e-13
250  2001-12-28  7111.41669699455205917        ...             -644.77                  1.14e-13
251  2002-01-03                    None        ...                 NaN                       NaN
252  2002-01-04                    None        ...                 NaN                       NaN
253  2002-01-07                    None        ...                 NaN                       NaN

Saving the model

>>> model.save_model(schema_name='MySchema', table_name='MyTable',if_exists='replace')

Reloading the model

>>> model2 = AutoTimeSeries(conn_context=CONN)
>>> model2.load_model(schema_name='MySchema', table_name='MyTable')

Predicting with the reloaded model

>>> # It is required to specify some attributes again
>>> model2.set_params(time_column_name='Date', target='Cash')
>>> hana_df = DataFrame(CONN,
...                     'SELECT "Date" , "Cash" '
...                     'from APL_SAMPLES.CASHFLOWS_FULL '
...                     'ORDER BY 1')
>>> out = model2.predict(hana_df, apply_horizon=3)
>>> out.collect().tail(5)
           Date            ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.42329499999  6055.761105              NaN              NaN
250  2001-12-28  7111.41669699999  6314.336098              NaN              NaN
251  2002-01-03                           None  7033.880804      4529.462710      9538.298899
252  2002-01-04                           None  6464.557223      3965.343397      8963.771049
253  2002-01-07                           None  6469.141663      3961.414900      8976.868427

Users must set the training dataset again after loading the model(train_data_ parameter).

>>> model2.set_params(train_data_=hana_df,
                      time_column_name='Date',
                      target='Cash')
>>> out = model2.forecast(forecast_length=3)
>>> out.collect().tail(5)
           Date            ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.42329499999  6055.761105              NaN              NaN
250  2001-12-28  7111.41669699999  6314.336098              NaN              NaN
251  2002-01-03              None  7033.880804      4529.462710      9538.298899
252  2002-01-04              None  6464.557223      3965.343397      8963.771049
253  2002-01-07              None  6469.141663      3961.414900      8976.868427

Attributes

model_: hana_ml DataFrame

The trained model content

summary_: APLArtifactTable

The reference to the “SUMMARY” table generated by the model training. This table contains the summary about the model training.

indicators_: APLArtifactTable

The reference to the “INDICATORS” table generated by the model training. This table contains the various metrics related to the model and its variables.

fit_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table generated by the model training

var_desc_: APLArtifactTable

The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training

applyout_: hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table that is produced when making predictions.

train_data_: hana_ml DataFrame

The train dataset

Methods

fit(data[, key, features])

Fits the model.

fit_predict(data[, key, features, horizon])

Fits a model and generate forecasts in a single call to the FORECAST APL function.

forecast([forecast_length, data])

Uses the fitted model to generate out-of-sample forecasts.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_model_components()

Returns a dictionary containing the description of the model components, that is trend, cycles and fluctuations, used by the model to generate forecasts.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns a dictionary containing the performance metrics of the model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data[, apply_horizon, …])

Uses the fitted model to generate forecasts.

save_artifact(artifact_df, schema_name, …)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, …])

Saves the model into a table.

set_params(**parameters)

Sets attributes of the current model.

set_params(**parameters)

Sets attributes of the current model.

Parameters

parameters: dict

Contains attribute names and values in the form of keyword arguments

fit(data, key=None, features=None)

Fits the model.

Parameters

data: hana_ml DataFrame

The training dataset

key: str, optional

The column used as row identifier of the dataset. This column corresponds to the time column name. As a result, setting this parameter will overwrite the time_column_name model setting.

features: list of str, optional

The names of the feature columns, meaning the date column and the extra-predictive variables. If features is not provided, it defaults to all columns except the target column.

Returns

self: object

predict(data, apply_horizon=None, apply_last_time_point=None)

Uses the fitted model to generate forecasts.

Parameters

data: hana_ml DataFrame

The input dataset used for predictions

apply_horizon: int, optional

The number of forecasts to generate. By default, the number of forecasts is the horizon on which the model was trained.

apply_last_time_point: str, optional

The time point corresponding to the start of the forecast period. Forecasts will be generated starting from the next time point after the ‘apply_last_time_point’. By default, this parameter is set to the value of ‘last_training_time_point’ known from the model training.

Returns

hana_ml DataFrame

By default the output contains the following columns:
  • <the name of the time column>

  • ACTUAL: the actual value of time series

  • PREDICTED: the forecast value

  • LOWER_INT_95PCT: the lower limit of 95% confidence interval

  • UPPER_INT_95PCT: the upper limit of 95% confidence interval

If ExtraMode is set to true, the output dataframe will also contain the breaking down of the time series into a trend, cycles, fluctuations and residuals components.

Examples

Default output

>>> out = model.predict(hana_df)
>>> out.collect().tail(5)
       Date            ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.42329499999  6055.761105              NaN              NaN
250  2001-12-28  7111.41669699999  6314.336098              NaN              NaN
251  2002-01-03                           None  7033.88080      4529.46271      9538.29889
252  2002-01-04                           None  6464.55722      3965.34339      8963.77104
253  2002-01-07                           None  6469.14166      3961.41490      8976.86842

Retrieving forecasts and components (predicted, trend, cycles and fluctuations). The output columns are suffixed with the horizon index. For example, Trend_1 means the trend component of the first horizon.

>>> model.set_params(extra_applyout_settings={'ExtraMode': True})
>>> out = model.predict(hana_df)
>>> out.collect().tail(5)
        Date                               ACTUAL  PREDICTED_1      Trend_1          249  2001-12-27  5995.423294999999598076101392507553  6055.761105  6814.405390   ...
250  2001-12-28  7111.416696999999658146407455205917  6314.336098  6839.334762   ...
251  2002-01-03                                 None  7033.880804  6991.163710   ...
252  2002-01-04                                 None  6464.557223  7016.843985   ...
253  2002-01-07                                 None  6469.141663  7094.528433   ...
fit_predict(data, key=None, features=None, horizon=None)

Fits a model and generate forecasts in a single call to the FORECAST APL function. This method offers a faster way to perform the model training and forecasting.

However, the user will not have access to the model used internally since it is deleted after the computation of the forecasts.

Parameters

data: hana_ml DataFrame

The input time series dataset

key: str, optional

The date column name. By default, it is equal to the model parameter time_column_name. If it is given, the model parameter time_column_name will be overwritten.

features: list of str, optional

The column names corresponding to the extra-predictable variables (exogenous variables). If features is not provided, it is equal to all columns except the target column.

horizon: int, optional

The number of forecasts to generate. The default value equals to the horizon parameter of the model.

Returns

hana_ml DataFrame

The output is the same as the predict() method.

forecast(forecast_length=None, data=None)

Uses the fitted model to generate out-of-sample forecasts. The model is supposed to be already fitted with a given dataset (training dataset). This method forecasts over a number of steps after the end of the training dataset. When there are extra-predictive variable (exogenous variables), the input parameter data is required. It must contain the values of the extra-predictable variables for the forecast period. If there is no extra-predictive variable, only the forecast_length parameter is needed.

Parameters

forecast_length: int, optional

The number of forecasts to generate from the end of the train dataset. This parameter is by default the horizon specified in the model parameter.

data: hana_ml DataFrame, optional

The time series with extra-predictable variables used for forecasting. This parameter is required if extra-predictive variables are used in the model. When this parameter is given, the parameter ‘forecast_length’ is ignored.

Returns

hana_ml DataFrame

The output is the same as the predict() method.

Examples

Case where there is no extra-predictable variable:

>>> train_df = DataFrame(CONN,
                         'SELECT "Date" , "Cash" '
                         'from APL_SAMPLES.CASHFLOWS_FULL '
                         'where "Cash" is not null '
                         'ORDER BY 1')
>>> print(train_df.collect().tail(5))
            Date         Cash
246  2001-12-20  6382.441052
247  2001-12-21  5652.882539
248  2001-12-26  5081.372996
249  2001-12-27  5995.423295
250  2001-12-28  7111.416697
>>> model = AutoTimeSeries(CONN, time_column_name='Date',
                           target='Cash',
                           horizon=3)
>>> model.fit(train_df)
>>> out = model.forecast(forecast_length=3)
>>> out.collect().tail(5)
           Date                        ACTUAL    PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.42329499999901392507553  6814.405390              NaN              NaN
250  2001-12-28  7111.41669699999907455205917  6839.334762              NaN              NaN
251  2001-12-29                          None  6864.371407      -224.079492     13952.822306
252  2001-12-30                          None  6889.515324      -211.264912     13990.295559
253  2001-12-31                          None  6914.766513      -187.180923     14016.713949

Case where there are extra-predictable variables:

>>> train_df = DataFrame(CONN,
                        'SELECT * '
                        'from APL_SAMPLES.CASHFLOWS_FULL '
                        'WHERE "Cash" is not null '
                        'ORDER BY 1')
>>> print(train_df.collect().tail(5))
           Date  WorkingDaysIndices     ...       BeforeLastWMonth         Cash
246  2001-12-20                  13     ...                      1  6382.441052
247  2001-12-21                  14     ...                      1  5652.882539
248  2001-12-26                  15     ...                      0  5081.372996
249  2001-12-27                  16     ...                      0  5995.423295
250  2001-12-28                  17     ...                      0  7111.416697
>>> # Extra-predictable variables to be provided as the forecast period
>>> forecast_df = DataFrame(CONN,
                           'SELECT * '
                           'from APL_SAMPLES.CASHFLOWS_FULL '
                           'WHERE "Cash" is null '
                           'ORDER BY 1 '
                           'LIMIT 3')
>>> print(forecast_df.collect())
         Date  WorkingDaysIndices  ...   BeforeLastWMonth  Cash
0  2002-01-03                   0  ...                  0  None
1  2002-01-04                   1  ...                  0  None
2  2002-01-07                   2  ...                  0  None
>>> model = AutoTimeSeries(CONN,
                           time_column_name='Date',
                           target='Cash',
                           horizon=3)
>>> model.fit(train_df)
>>> out = model.forecast(data=forecast_df)
>>> out.collect().tail(5)
           Date                          ACTUAL  PREDICTED  LOWER_INT_95PCT  UPPER_INT_95PCT
249  2001-12-27  5995.4232949999996101392507553    6814.41              NaN              NaN
250  2001-12-28  7111.4166969999996407455205917    6839.33              NaN              NaN
251  2001-12-29                            None    6864.37          -224.08         13952.82
252  2001-12-30                            None    6889.52          -211.26         13990.30
253  2001-12-31                            None    6914.77          -187.18         14016.71
get_model_components()

Returns a dictionary containing the description of the model components, that is trend, cycles and fluctuations, used by the model to generate forecasts.

Returns

A dictionary with 3 possible keys: ‘Trend’, ‘Cycles’, ‘Fluctuations’. For example:

>>> model.get_model_components()
{
    "Trend": "Linear(TIME)",
     "Cycles": None,
     "Fluctuations": "AR(36)"
}
get_performance_metrics()

Returns a dictionary containing the performance metrics of the model. The metrics are provided for each forecast horizon.

Returns

Dictionary

The dictionary contains the performance metrics of the current model. Each metric is associated to a list containing <horizon> elements. This list contains the values of the metric measured for horizon 1 to <horizon>.

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters

schema_name: str

The schema name

table_name: str

The table name

oid : str, optional

If the table contains several models, the OID must be given as an identifier. If it is not provided, the whole table is read.

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns

The reference to OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns

The reference to INDICATORS table : hana_ml DataFrame

This table provides the performance metrics of the last model training

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns

The attribute-values of the model : dictionary

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns

The reference to the OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns

The reference to the SUMMARY table : hana_ml DataFrame

This contains execution summary of the last model training

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters

schema_name: str

The schema name

artifact_df : hana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters

schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns

None

The model is saved into a table with the following columns:

  • “OID” NVARCHAR(50), – Serve as ID

  • “FORMAT” NVARCHAR(50), – APL technical info

  • “LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model

hana_ml.algorithms.apl.gradient_boosting_classification

This module provides the SAP HANA APL gradient boosting classification algorithm.

The following classes are available:

class hana_ml.algorithms.apl.gradient_boosting_classification.GradientBoostingClassifier(conn_context, early_stopping_patience=10, eval_metric='MultiClassLogLoss', learning_rate=0.05, max_depth=4, max_iterations=1000, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None)

Bases: hana_ml.algorithms.apl.gradient_boosting_classification._GradientBoostingClassifierBase

SAP HANA APL Gradient Boosting Multiclass Classifier algorithm.

Parameters

conn_context: ConnectionContext

The connection object to an SAP HANA database

early_stopping_patience: int, optional

If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. The default value is 10.

eval_metric: str, optional

The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are ‘MultiClassClassificationError’ and ‘MultiClassLogLoss’. The default value is ‘MultiClassLogLoss’.

learning_rate: float, optional

The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. The default value is 0.05.

max_depth: int, optional

The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. The default value is 4.

max_iterations: int, optional

The maximum number of boosting iterations to fit the model. The default value is 1000.

number_of_jobs: int, optional

The number of threads allocated to the model training and apply parallelization. The default value is 4.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.

extra_applyout_settings: dict, optional

Determines the output of the predict() method. The possible values are: - By default (None value): the default output.

  • <KEY>: the key column if it provided in the dataset

  • TRUE_LABEL: the class label if provided in the dataset

  • PREDICTED: the predicted label

  • PROBABILITY: the probability of the prediction(confidence)

  • {‘APL/ApplyExtraMode’: ‘AllProbabilities’}: the probabilities for each class.
    • <KEY>: the key column if provided in the dataset

    • TRUE_LABEL: the class label if given in the dataset

    • PREDICTED: the predicted label

    • PROBA_<label_value1>: the probability for the class <label_value1>

    … - PROBA_<label_valueN>: the probability for the class <label_valueN>

  • {‘APL/ApplyExtraMode’: ‘Individual Contributions’}: the feature importance for every

sample
  • <KEY>: the key column if provided in the dataset

  • TRUE_LABEL: the class label when if provided in the dataset

  • PREDICTED: the predicted label

  • gb_contrib_<VAR1>: the contribution of the variable VAR1 to the score

… - gb_contrib_<VARN>: the contribution of the variable VARN to the score - gb_contrib_constant_bias: the constant bias contribution to the score

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won’t be possible anymore to have a key defined in any input dataset. The key is particularly useful to join the predictions output to the input dataset.

By default, if not provided, SAP HANA APL guesses the variable description by reading the first 100 rows. But, the results may be incorrect. The user can overwrite the guessed description by explicitly setting the variable_storages, variable_value_types and variable_missing_strings parameters. For example:

model.set_params(
        variable_storages = {
            'ID': 'integer',
            'sepal length (cm)': 'number'
            })
model.set_params(
        variable_value_types = {
            'sepal length (cm)': 'continuous'
            })
model.set_params(
        variable_missing_strings = {
            'sepal length (cm)': '-1'
            })

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification     import GradientBoostingClassifier
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS)
>>> # -- Creates hana_ml DataFrame
>>> hana_df = DataFrame(CONN,
...           'SELECT "id", "class", "capital-gain", "native-country" from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = GradientBoostingClassifier(conn_context=CONN)
>>> model.fit(hana_df, label='native-country', key='id')

Debriefing

>>> # Global performance metrics of the model
>>> model.get_performance_metrics()
{'BalancedErrorRate': 0.9761904761904762, 'BalancedClassificationRate': ...}
>>> # Performance metrics of the model for each class
>>> model.get_metrics_per_class()
{'Precision': {'Cambodia': 0.0, 'Canada': 0.0, 'China': 0.0, 'Columbia': 0.0, ...}
>>> model.get_feature_importances()
{'ExactSHAP': OrderedDict([('class', 0.7858160138130188), ('capital-gain', 0.21418397128582)])}

Making predictions

>>> # Default output
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
   id     TRUE_LABEL      PREDICTED  PROBABILITY
0  30  United-States  United-States     0.890425
1  63  United-States  United-States     0.890425
2  66  United-States  United-States     0.890425
...
>>> # All probabilities
>>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'AllProbabilities'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
   id     TRUE_LABEL        ...        PROBA_Vietnam  PROBA_Yugoslavia
0  30  United-States        ...             0.002123          0.000471
1  63  United-States        ...             0.002123          0.000471
2  66  United-States        ...             0.002123          0.000471
...
>>> # Individual Contributions
>>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
   id     TRUE_LABEL      PREDICTED  gb_contrib_class  gb_contrib_capital-gain  ...
0  30  United-States  United-States         -0.025366                -0.014416  ...
1  63  United-States  United-States         -0.025366                -0.014416  ...
2  66  United-States  United-States         -0.025366                -0.014416  ...
...

Attributes

label: str

The target column name. This attribute is set when the fit() method is called.

model_: hana_ml DataFrame

The trained model content

summary_: APLArtifactTable

The reference to the “SUMMARY” table generated by the model training. This table contains the content of the model training summary.

indicators_: APLArtifactTable

The reference to the “INDICATORS” table generated by the model training. This table contains various metrics related to the model and model variables.

fit_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table generated by the model training

var_desc_: APLArtifactTable

The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training

applyout_: hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table when a prediction was made

Methods

fit(data[, key, features, label])

Fits the model.

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

get_evalmetrics()

Returns the values of the evaluation metric at each iteration.

get_feature_importances()

Returns the feature importances.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_metrics_per_class()

Returns the performance for each class.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data)

Makes predictions with the fitted model.

save_artifact(artifact_df, schema_name, …)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, …])

Saves the model into a table.

score(data)

Returns the mean accuracy on the provided test dataset.

set_params(**parameters)

Sets attributes of the current model.

set_params(**parameters)

Sets attributes of the current model.

Parameters

parameters: dict

The names and values of the attributes to change

get_metrics_per_class()

Returns the performance for each class.

Returns

A dictionary.

fit(data, key=None, features=None, label=None)

Fits the model.

Parameters

data : DataFrame

The training dataset

key : str, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.

features : list of str, optional

The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.

label : str, optional

The name of the label column. Default is the last column.

Returns

self : object

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won’t be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

Returns

The best iteration: int

get_evalmetrics()

Returns the values of the evaluation metric at each iteration. These values are based on the estimation dataset.

Returns

A dictionary:

{‘<MetricName>’: <List of values>}

get_feature_importances()

Returns the feature importances.

Returns

feature importances : dict

{ <importance_metric> : OrderedDictionary({ <feature_name> : <value> })

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns

The reference to OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns

The reference to INDICATORS table : hana_ml DataFrame

This table provides the performance metrics of the last model training

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns

The attribute-values of the model : dictionary

get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns

A dictionary with metric name as key and metric value as value.

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns

The reference to the OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns

The reference to the SUMMARY table : hana_ml DataFrame

This contains execution summary of the last model training

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters

schema_name: str

The schema name

table_name: str

The table name

oid : str. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

predict(data)

Makes predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the ‘extra_applyout_settings’ parameter in the model. This parameter is described with examples in the class section.

Parameters

data: hana_ml DataFrame

The input dataset used for prediction

Returns

Prediction output: hana_ml DataFrame

The default output is (if the model ‘extra_applyout_settings’ parameter is unset):

  • ID: the key column

  • TRUE_LABEL: the true label if it is given in the input dataset

  • PREDICTED: the predicted label

  • PROBABILITY: the probability of the predicted label

In multinomial classification, users can request the probabilities of all classes by

setting the parameter ‘extra_applyout_settings’ to

{‘APL/ApplyExtraMode’: ‘AllProbabilities’}.

The output will be:

  • ID: the key column

  • TRUE_LABEL: the true label if it is given in the input dataset

  • PREDICTED: the predicted label

  • PROBA_<class_1>: the probability of the class <class_1>

… - PROBA_<class_n>: the probability of the class <class_n>

To get the individual contributions of each variable for each individual sample,

the ‘extra_applyout_settings’ parameter must be set to

{‘APL/ApplyExtraMode’: ‘Individual Contributions’}.

The output will contain the following columns:

  • ID: key column,

  • TRUE_LABEL: the actual label

  • PREDICTED: the predicted label

  • gb_contrib_<VAR1>: the contribution of the variable <VAR1> to the score

… - gb_contrib_<VARN>: the contribution of the variable <VARN> to the score - gb_contrib_constant_bias: the constant bias contribution to the score

Users can also set APL/ApplyExtraMode with other values, for instance:

‘extra_applyout_settings’ = {‘APL/ApplyExtraMode’: ‘BestProbabilityAndDecision’}.

New SAP Hana APL settings may be provided over time, so please check the SAP HANA APL

documentation to know which settings are available:

See Function Reference > Predictive Model Services > APPLY_MODEL > OPERATION_CONFIG

Parameters in the SAP HANA APL Reference Guide.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters

schema_name: str

The schema name

artifact_df : hana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters

schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns

None

The model is saved into a table with the following columns:

  • “OID” NVARCHAR(50), – Serve as ID

  • “FORMAT” NVARCHAR(50), – APL technical info

  • “LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model

score(data)

Returns the mean accuracy on the provided test dataset.

Parameters

data: hana_ml DataFrame

The test dataset used to compute the score. The labels must be provided in the dataset.

Returns

mean average accuracy: float

class hana_ml.algorithms.apl.gradient_boosting_classification.GradientBoostingBinaryClassifier(conn_context, early_stopping_patience=10, eval_metric='LogLoss', learning_rate=0.05, max_depth=4, max_iterations=1000, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None)

Bases: hana_ml.algorithms.apl.gradient_boosting_classification._GradientBoostingClassifierBase

SAP HANA APL Gradient Boosting Binary Classifier algorithm. It is very similar to GradientBoostingClassifier, the multiclass classifier. Its particularity lies in the provided metrics which are specific to binary classification.

Parameters

conn_context: ConnectionContext

The connection object to an SAP HANA database

early_stopping_patience: int, optional

If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. The default value is 10.

eval_metric: str, optional

The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are ‘LogLoss’,’AUC’ and ‘ClassificationError’. The default value is ‘LogLoss’.

learning_rate: float, optional

The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. The default value is 0.05.

max_depth: int, optional

The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. The default value is 4.

max_iterations: int, optional

The maximum number of boosting iterations to fit the model. The default value is 1000.

number_of_jobs: int, optional

The number of threads allocated to the model training and apply parallelization. The default value is 4.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value type (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.

extra_applyout_settings: dict, optional

Determines the output of the predict() method. The possible values are: - By default (None value): the default output.

  • <KEY>: the key column if provided in the dataset

  • TRUE_LABEL: the class label if provided in the dataset

  • PREDICTED: the predicted label

  • PROBABILITY: the probability of the prediction(confidence)

  • {‘APL/ApplyExtraMode’: ‘Individual Contributions’}: the individual contributions of each

variable to the score. The output is:
  • <KEY>: the key column if provided in the dataset

  • TRUE_LABEL: the class label if provided in the dataset

  • gb_contrib_<VAR1>: the contribution of the variable VAR1 to the score

… - gb_contrib_<VARN>: the contribution of the variable VARN to the score - gb_contrib_constant_bias: the constant bias contribution to the score

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_classification         import GradientBoostingBinaryClassifier
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS)
>>> # -- Creates hana_ml DataFrame
>>> hana_df = DataFrame(CONN, 'SELECT * from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = GradientBoostingBinaryClassifier(conn_context=CONN)
>>> model.fit(hana_df, label='class', key='id')

Debriefing

>>> # Global performance metrics of the model
>>> model.get_performance_metrics()
{'LogLoss': 0.2595688879229439, 'PredictivePower': 0.8551, 'PredictionConfidence': 0.9801, ...}
>>> model.get_feature_importances()
{'ExactSHAP': OrderedDict([('age', 0.17464688420295715), ('relationship', 0.14576564729213715),    ('education-num', 0.1502334326505661)...}

Making predictions

>>> # Default output
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().sample(3) # returns the output as a pandas DataFrame
          id  TRUE_LABEL  PREDICTED  PROBABILITY
3979   42472           0          0     0.983028
9435   37212           0          0     0.924352
42594  29297           1          1     0.999609
...
>>> # Individual Contributions
>>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().sample(3) # returns the output as a pandas DataFrame
      id  TRUE_LABEL  gb_contrib_age  gb_contrib_workclass  gb_contrib_fnlwgt  ...
0  18448           0       -1.098452             -0.001238           0.060850  ...
1  18457           0       -0.731512             -0.000448           0.020060  ...
2  18540           0       -0.024523              0.027065           0.158083  ...
...

Attributes

label: str

The target column name. This attribute is set when the fit() method is called.

model_: hana_ml DataFrame

The trained model content

summary_: APLArtifactTable

The reference to the “SUMMARY” table generated by the model training. This table contains the content of the model training summary.

indicators_: APLArtifactTable

The reference to the “INDICATORS” table generated by the model training. This table contains various metrics related to the model and model variables.

fit_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table generated by the model training

var_desc_: APLArtifactTable

The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training

applyout_: hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table when a prediction was made

Methods

fit(data[, key, features, label])

Fits the model.

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

get_evalmetrics()

Returns the values of the evaluation metric at each iteration.

get_feature_importances()

Returns the feature importances.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data)

Makes predictions with the fitted model.

save_artifact(artifact_df, schema_name, …)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, …])

Saves the model into a table.

score(data)

Returns the mean accuracy on the provided test dataset.

set_params(**parameters)

Sets attributes of the current model.

set_params(**parameters)

Sets attributes of the current model.

Parameters

parameters: dict

The attribute names and values

get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns

A dictionary with metric name as key and metric value as value.

fit(data, key=None, features=None, label=None)

Fits the model.

Parameters

data : DataFrame

The training dataset

key : str, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.

features : list of str, optional

The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.

label : str, optional

The name of the label column. Default is the last column.

Returns

self : object

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won’t be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

Returns

The best iteration: int

get_evalmetrics()

Returns the values of the evaluation metric at each iteration. These values are based on the estimation dataset.

Returns

A dictionary:

{‘<MetricName>’: <List of values>}

get_feature_importances()

Returns the feature importances.

Returns

feature importances : dict

{ <importance_metric> : OrderedDictionary({ <feature_name> : <value> })

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns

The reference to OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns

The reference to INDICATORS table : hana_ml DataFrame

This table provides the performance metrics of the last model training

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns

The attribute-values of the model : dictionary

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns

The reference to the OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns

The reference to the SUMMARY table : hana_ml DataFrame

This contains execution summary of the last model training

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters

schema_name: str

The schema name

table_name: str

The table name

oid : str. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

predict(data)

Makes predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the ‘extra_applyout_settings’ parameter in the model. This parameter is described with examples in the class section.

Parameters

data: hana_ml DataFrame

The input dataset used for prediction

Returns

Prediction output: hana_ml DataFrame

The default output is (if the model ‘extra_applyout_settings’ parameter is unset):

  • ID: the key column

  • TRUE_LABEL: the true label if it is given in the input dataset

  • PREDICTED: the predicted label

  • PROBABILITY: the probability of the predicted label

In multinomial classification, users can request the probabilities of all classes by

setting the parameter ‘extra_applyout_settings’ to

{‘APL/ApplyExtraMode’: ‘AllProbabilities’}.

The output will be:

  • ID: the key column

  • TRUE_LABEL: the true label if it is given in the input dataset

  • PREDICTED: the predicted label

  • PROBA_<class_1>: the probability of the class <class_1>

… - PROBA_<class_n>: the probability of the class <class_n>

To get the individual contributions of each variable for each individual sample,

the ‘extra_applyout_settings’ parameter must be set to

{‘APL/ApplyExtraMode’: ‘Individual Contributions’}.

The output will contain the following columns:

  • ID: key column,

  • TRUE_LABEL: the actual label

  • PREDICTED: the predicted label

  • gb_contrib_<VAR1>: the contribution of the variable <VAR1> to the score

… - gb_contrib_<VARN>: the contribution of the variable <VARN> to the score - gb_contrib_constant_bias: the constant bias contribution to the score

Users can also set APL/ApplyExtraMode with other values, for instance:

‘extra_applyout_settings’ = {‘APL/ApplyExtraMode’: ‘BestProbabilityAndDecision’}.

New SAP Hana APL settings may be provided over time, so please check the SAP HANA APL

documentation to know which settings are available:

See Function Reference > Predictive Model Services > APPLY_MODEL > OPERATION_CONFIG

Parameters in the SAP HANA APL Reference Guide.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters

schema_name: str

The schema name

artifact_df : hana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters

schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns

None

The model is saved into a table with the following columns:

  • “OID” NVARCHAR(50), – Serve as ID

  • “FORMAT” NVARCHAR(50), – APL technical info

  • “LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model

score(data)

Returns the mean accuracy on the provided test dataset.

Parameters

data: hana_ml DataFrame

The test dataset used to compute the score. The labels must be provided in the dataset.

Returns

mean average accuracy: float

hana_ml.algorithms.apl.gradient_boosting_regression

This module provides the SAP HANA APL gradient boosting regression algorithm.

The following classes are available:

class hana_ml.algorithms.apl.gradient_boosting_regression.GradientBoostingRegressor(conn_context, early_stopping_patience=10, eval_metric='RMSE', learning_rate=0.05, max_depth=4, max_iterations=1000, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)

Bases: hana_ml.algorithms.apl.gradient_boosting_base.GradientBoostingBase

SAP HANA APL Gradient Boosting Regression algorithm.

Parameters

conn_context: ConnectionContext

The connection object to an SAP HANA database

early_stopping_patience: int, optional

If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. The default value is 10.

eval_metric: str, optional

The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are ‘MAE’ and ‘RMSE’. The default value is ‘RMSE’.

learning_rate: float, optional

The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. The default value is 0.05.

max_depth: int, optional

The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. The default value is 4.

max_iterations: int, optional

The maximum number of boosting iterations to fit the model. The default value is 1000.

number_of_jobs: int, optional

The number of threads allocated to the model training and apply parallelization. The default value is 4.

variable_storages: dict, optional

Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.

variable_value_types: dict, optional

Specifies the variable value types (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.

variable_missing_strings: dict, optional

Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.

extra_applyout_settings: dict, optional

Determines the output of the predict() method. The possible values are: - By default (None value): the default output.

  • <KEY>: the key column if provided in the dataset

  • TRUE_LABEL: the actual value if provided

  • PREDICTED: the predicted value

  • {‘APL/ApplyExtraMode’: ‘Individual Contributions’}: the feature importance for every

sample
  • <KEY>: the key column if provided

  • TRUE_LABEL: the actual value if provided

  • PREDICTED: the predicted value

  • gb_contrib_<VAR1>: the contribution of the VAR1 variable to the score

… - gb_contrib_<VARN>: the contribution of the VARN variable to the score - gb_contrib_constant_bias: the constant bias contribution

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won’t be possible anymore to have a key defined in any input dataset. The key is particularly useful to join the predictions output to the input dataset.

By default, if not provided, SAP HANA APL guesses the variable description by reading the first 100 rows. But, the results may be incorrect. The user can overwrite the guessed description by explicitly setting the variable_storages, variable_value_types and variable_missing_strings parameters. For example:

model.set_params(
        variable_storages = {
            'ID': 'integer',
            'sepal length (cm)': 'number'
            })
model.set_params(
        variable_value_types = {
            'sepal length (cm)': 'continuous'
            })
model.set_params(
        variable_missing_strings = {
            'sepal length (cm)': '-1'
            })

Examples

>>> from hana_ml.algorithms.apl.gradient_boosting_regression import GradientBoostingRegressor
>>> from hana_ml.dataframe import ConnectionContext, DataFrame

Connecting to SAP HANA

>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS)
>>> # -- Creates hana_ml DataFrame
>>> hana_df = DataFrame(CONN,
...                     'SELECT "id", "class", "capital-gain", '
...                     '"native-country", "age" from APL_SAMPLES.CENSUS')

Creating and fitting the model

>>> model = GradientBoostingRegressor(conn_context=CONN)
>>> model.fit(hana_df, label='age', key='id')

Debriefing

>>> # Global performance metrics of the model
>>> model.get_performance_metrics()
{'L1': 10.514114510321283, 'MeanAbsoluteError': 10.514114510321283, 'L2': 13.03664860 ...}
>>> model.get_feature_importances()
{'ExactSHAP': OrderedDict([('class', 0.7681519389152527), ('capital-gain',  ...}

Making predictions

>>> # Default output
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
          id  TRUE_LABEL  PREDICTED
39184  21772          27         25
16537   7331          33         43
7908   35226          65         42
...
>>> # Individual Contributions
>>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})
>>> applyout_df = model.predict(hana_df)
>>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame
     id  TRUE_LABEL  gb_contrib_workclass  gb_contrib_fnlwgt  gb_contrib_education  ...
0  6241          21             -1.330736          -0.385088              0.373539  ...
1  6248          18             -0.784536          -2.191791             -1.788672  ...
2  6253          26             -0.773891           0.358133             -0.185864  ...
...

Attributes

label: str

The target column name. This attribute is set when the fit() method is called. Users don’t need to set it explicitly, except if the model is loaded from a table. In this case, this attribute must be set before calling predict().

model_: hana_ml DataFrame

The trained model content

summary_: APLArtifactTable

The reference to the “SUMMARY” table generated by the model training. This table contains the content of the model training summary.

indicators_: APLArtifactTable

The reference to the “INDICATORS” table generated by the model training. This table contains various metrics related to the model and model variables.

fit_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table generated by the model training

var_desc_: APLArtifactTable

The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training

applyout_: hana_ml DataFrame

The predictions generated the last time the model was applied

predict_operation_logs_: APLArtifactTable

The reference to the “OPERATION_LOG” table when a prediction was made

Methods

fit(data[, key, features, label])

Fits the model.

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

get_evalmetrics()

Returns the values of the evaluation metric at each iteration.

get_feature_importances()

Returns the feature importances.

get_fit_operation_log()

Retrieves the operation log table after the model training.

get_indicators()

Retrieves the Indicator table after model training.

get_params()

Retrieves attributes of the current object.

get_performance_metrics()

Returns the performance metrics of the last trained model.

get_predict_operation_log()

Retrieves the operation log table after the model training.

get_summary()

Retrieves the summary table after model training.

load_model(schema_name, table_name[, oid])

Loads the model from a table.

predict(data)

Generates predictions with the fitted model.

save_artifact(artifact_df, schema_name, …)

Saves an artifact, a temporary table, into a permanent table.

save_model(schema_name, table_name[, …])

Saves the model into a table.

score(data)

Computes the R^2 (Coefficient of determination) indicator on the predictions of the provided dataset.

set_params(**parameters)

Sets attributes of the current model.

set_params(**parameters)

Sets attributes of the current model.

Parameters

parameters: dict

The attribute names and values

get_performance_metrics()

Returns the performance metrics of the last trained model.

Returns

A dictionary with metric name as key and metric value as value.

predict(data)

Generates predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the ‘extra_applyout_settings’ parameter in the model. This parameter is described with examples in the class section.

Parameters

data: hana_ml DataFrame

The input dataset used for prediction

Returns

Prediction output: hana_ml DataFrame

score(data)

Computes the R^2 (Coefficient of determination) indicator on the predictions of the provided dataset.

Parameters

data: hana_ml DataFrame

The dataset used for prediction. It must contain the actual target values so that the score could be computed.

Returns

R2 indicator: float

fit(data, key=None, features=None, label=None)

Fits the model.

Parameters

data : DataFrame

The training dataset

key : str, optional

The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.

features : list of str, optional

The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.

label : str, optional

The name of the label column. Default is the last column.

Returns

self : object

Notes

It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won’t be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.

get_best_iteration()

Returns the iteration that has provided the best performance on the validation dataset during the model training.

Returns

The best iteration: int

get_evalmetrics()

Returns the values of the evaluation metric at each iteration. These values are based on the estimation dataset.

Returns

A dictionary:

{‘<MetricName>’: <List of values>}

get_feature_importances()

Returns the feature importances.

Returns

feature importances : dict

{ <importance_metric> : OrderedDictionary({ <feature_name> : <value> })

get_fit_operation_log()

Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.

Returns

The reference to OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs of the last model training

get_indicators()

Retrieves the Indicator table after model training.

Returns

The reference to INDICATORS table : hana_ml DataFrame

This table provides the performance metrics of the last model training

get_params()

Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.

Returns

The attribute-values of the model : dictionary

get_predict_operation_log()

Retrieves the operation log table after the model training.

Returns

The reference to the OPERATION_LOG table : hana_ml DataFrame

This table provides detailed logs about the last prediction

get_summary()

Retrieves the summary table after model training.

Returns

The reference to the SUMMARY table : hana_ml DataFrame

This contains execution summary of the last model training

load_model(schema_name, table_name, oid=None)

Loads the model from a table.

Parameters

schema_name: str

The schema name

table_name: str

The table name

oid : str. optional

If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.

save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)

Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.

Parameters

schema_name: str

The schema name

artifact_df : hana_ml DataFrame

The artifact created after fit or predict methods are called

table_name: str

The table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises a ValueError

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.

Examples

>>> myModel.save_artifact(
...             artifactTable=myModel.indicators_,
...             schema_name='MySchema',
...             table_name='MyModel_Indicators',
...             if_exists='replace'
...             )
save_model(schema_name, table_name, if_exists='fail', new_oid=None)

Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).

Parameters

schema_name: str

The schema name

table_name: str

Table name

if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’

The behavior when the table already exists:
  • fail: Raises an Error

  • replace: Drops the table before inserting new values

  • append: Inserts new values to the existing table

new_oid: str. Optional.

If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.

Returns

None

The model is saved into a table with the following columns:

  • “OID” NVARCHAR(50), – Serve as ID

  • “FORMAT” NVARCHAR(50), – APL technical info

  • “LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model