hana_ml.algorithms.apl package¶
The Algorithms APL Package consists of the following sections:
hana_ml.algorithms.apl.classification¶
This module provides the SAP HANA APL binary classification algorithm.
The following classes are available:
-
class
hana_ml.algorithms.apl.classification.
AutoClassifier
(conn_context, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)¶ Bases:
hana_ml.algorithms.apl.robust_regression_base.RobustRegressionBase
SAP HANA APL Binary Classifier algorithm.
Parameters: - conn_context : ConnectionContext
The connection object to an SAP HANA database
- variable_auto_selection : bool, optional
When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables.
- polynomial_degree : int, optional
The polynomial degree of the model. Default is 1.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value type (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.
- extra_applyout_settings: dict optional
Defines other outputs the model should generate in addition to the predicted values. For example: {‘APL/ApplyReasonCode’:‘3;Mean;Below;False’} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Reference Guide.
- other_params: dict optional
Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
- ‘correlations_lower_bound’
- ‘correlations_max_kept’
- ‘cutting_strategy’
- ‘exclude_low_predictive_confidence’
- ‘risk_fitting’
- ‘risk_fitting_min_cumulated_frequency’
- ‘risk_fitting_nb_pdo’
- ‘risk_fitting_use_weights’
- ‘risk_gdo’
- ‘risk_mode’
- ‘risk_pdo’
- ‘risk_score’
- ‘score_bins_count’
- ‘target_key’
- ‘variable_selection_best_iteration’
- ‘variable_selection_min_nb_of_final_variables’
- ‘variable_selection_max_nb_of_final_variables’
- ‘variable_selection_mode’
- ‘variable_selection_nb_variables_removed_by_step’
- ‘variable_selection_percentage_of_contribution_kept_by_step’
- ‘variable_selection_quality_bar’
- ‘variable_selection_quality_criteria’
See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.
Notes
It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.
By default, when it is not given, SAP HANA APL guesses the variable description by reading the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values in these parameters, the user can overwrite the default guess. For example:
model.set_params( variable_storages = { 'ID': 'integer', 'sepal length (cm)': 'number' }) model.set_params( variable_value_types = { 'sepal length (cm)': 'continuous' }) model.set_params( variable_missing_strings = { 'sepal length (cm)': '-1' }) model.set_params( extra_applyout_settings={ 'APL/ApplyReasonCode':'3;Mean;Below;False' })
Examples
>>> from hana_ml.algorithms.apl.classification import AutoClassifier >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS') >>> # -- Creates hana_ml DataFrame >>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = AutoClassifier(conn_context=CONN, variable_auto_selection=True) >>> model.fit(hana_df, label='class', key='id')
Making the predictions
>>> applyout_df = model.predict(hana_df) >>> apply_out_df.collect() # returns the output as a pandas DataFrame id TRUE_LABEL PREDICTED PROBABILITY 0 30 0 0 0.688153 1 63 0 0 0.677693 2 66 0 0 0.700221
Debriefing
>>> model.get_performance_metrics() OrderedDict([('L1', 0.2522171212463023), ('L2', 0.32254434028379236), ...
>>> model.get_feature_importances() OrderedDict([('marital-status', 0.2172766583204266), ('capital-gain', 0.19521247617062215),...
Saving the model
>>> model.save_model(schema_name='MySchema', table_name='MyTable',if_exists='replace')
Reloading model and predicting
>>> model2 = AutoClassifier(conn_context=CONN) >>> model2.load_model(schema_name='MySchema', table_name='MyTable') >>> applyout2 = model2.predict(hana_df) >>> applyout2 id TRUE_LABEL PREDICTED PROBABILITY 0 30 0 0 0.688153 1 63 0 0 0.677693 2 66 0 0 0.700221
Attributes: - model_ : hana_ml DataFrame
The trained model content
- summary_ : APLArtifactTable
The reference to the “SUMMARY” table generated by the model training. This table contains the content of the model training summary.
- indicators_ : APLArtifactTable
The reference to the “INDICATORS” table generated by the model training. This table contains various metrics related to model and model variables.
- fit_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table generated by the model training
- var_desc_ : APLArtifactTable
The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training
- applyout_ : hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table when a prediction was made
Methods
fit
(data[, key, features, label])Fits the model. get_feature_importances
([method])Gets the feature importances (MaximumSmartVariableContribution). get_fit_operation_log
()Retrieves the operation log table after the model training. get_indicators
()Retrieves the Indicator table after model training. get_params
()Retrieves attributes of the current object. get_performance_metrics
()Gets the model performance metrics of the last model training. get_predict_operation_log
()Retrieves the operation log table after the model training. get_summary
()Retrieves the summary table after model training. load_model
(schema_name, table_name[, oid])Loads the model from a table. predict
(data)Makes predictions with the fitted model. save_artifact
(artifact_df, schema_name, …)Saves an artifact, a temporary table, into a permanent table. save_model
(schema_name, table_name[, …])Saves the model into a table. score
(data)Returns the mean accuracy on the provided test dataset. set_params
(**parameters)Sets attributes of the current model. -
fit
(data, key=None, features=None, label=None)¶ Fits the model.
Parameters: - data : DataFrame
The training dataset
- key : str, optional
The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended usage. See notes below.
- features : list of str, optional
The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.
- label : str, optional
The name of the label column. Default is the last column.
Returns: - self : object
Notes
It is highly recommended to use a dataset with key in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with key, because the model will not expect it.
-
get_feature_importances
(method=None)¶ Gets the feature importances (MaximumSmartVariableContribution).
Parameters: - method : str, optional
The method to be used to measure the feature contributions. It is not used for binary-classification and regression. It is only used for gradient boosting algorithm.
Returns: - feature importances : An OrderedDict { feature_name
-
get_fit_operation_log
()¶ Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
Returns: - The reference to OPERATION_LOG table : hana_ml DataFrame
- This table provides detailed logs of the last model training
-
get_indicators
()¶ Retrieves the Indicator table after model training.
Returns: - The reference to INDICATORS table : hana_ml DataFrame
- This table provides the performance metrics of the last model training
-
get_params
()¶ Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
Returns: - The attribute-values of the model : dictionary
-
get_performance_metrics
()¶ Gets the model performance metrics of the last model training.
Returns: - An OrderedDict with metric name as key and metric value as value.
- For example:
- rderedDict([(‘L1’, 8.59885654599923),
(‘L2’, 11.012352163260505), (‘LInf’, 67.0), (‘ErrorMean’, 0.33833594458645944), …
-
get_predict_operation_log
()¶ Retrieves the operation log table after the model training.
Returns: - The reference to the OPERATION_LOG table : hana_ml DataFrame
- This table provides detailed logs about the last prediction
-
get_summary
()¶ Retrieves the summary table after model training.
Returns: - The reference to the SUMMARY table : hana_ml DataFrame
- This contains execution summary of the last model training
-
load_model
(schema_name, table_name, oid=None)¶ Loads the model from a table.
Parameters: - schema_name: str
The schema name
- table_name: str
The table name
- oid : str. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
-
predict
(data)¶ Makes predictions with the fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.
Parameters: - data : hana_ml DataFrame
The dataset used for prediction
Returns: - Prediction output: hana_ml DataFrame
- The dataframe contains the following columns:
- - KEY : the key column if it was provided in the dataset
- - TRUE_LABEL : the class label when it was given in the dataset
- - PREDICTED : the predicted label
- - PROBABILITY : the probability that the current row is predicted as positive
- - SCORING_VALUE : the unnormalized scoring value
-
save_artifact
(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)¶ Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
Parameters: - schema_name: str
The schema name
- artifact_df : hana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
- fail: Raises a ValueError
- replace: Drops the table before inserting new values
- append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact( ... artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace' ... )
-
save_model
(schema_name, table_name, if_exists='fail', new_oid=None)¶ Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
Parameters: - schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
- fail: Raises an Error
- replace: Drops the table before inserting new values
- append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
Returns: - None
- The model is saved into a table with the following columns:
- “OID” NVARCHAR(50), – Serve as ID
- “FORMAT” NVARCHAR(50), – APL technical info
- “LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model
-
score
(data)¶ Returns the mean accuracy on the provided test dataset.
Parameters: - data : hana_ml DataFrame
The test dataset used to compute the score. The labels must be provided in the dataset.
Returns: - mean average accuracy: float
-
set_params
(**parameters)¶ Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.
Parameters: - params : dictionary
The attribute names and values
hana_ml.algorithms.apl.regression¶
This module contains SAP HANA APL regression algorithm.
The following classes are available:
-
class
hana_ml.algorithms.apl.regression.
AutoRegressor
(conn_context, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)¶ Bases:
hana_ml.algorithms.apl.robust_regression_base.RobustRegressionBase
This module provides the SAP HANA APL regression algorithm.
Parameters: - conn_context : ConnectionContext
The connection object to an SAP HANA database
- variable_auto_selection : bool optional
When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables
- polynomial_degree : int optional
The polynomial degree of the model. Default is 1.
- variable_storages: dict optional
Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.
- variable_value_types: dict optional
Specifies the variable value type (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.
- extra_applyout_settings: dict optional
Defines other outputs the model should generate in addition to the predicted values. For example: {‘APL/ApplyReasonCode’:‘3;Mean;Below;False’} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Reference Guide.
- other_params: dict optional
Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
- ‘correlations_lower_bound’
- ‘correlations_max_kept’
- ‘cutting_strategy’
- ‘exclude_low_predictive_confidence’
- ‘risk_fitting’
- ‘risk_fitting_min_cumulated_frequency’
- ‘risk_fitting_nb_pdo’
- ‘risk_fitting_use_weights’
- ‘risk_gdo’
- ‘risk_mode’
- ‘risk_pdo’
- ‘risk_score’
- ‘score_bins_count’
- ‘variable_auto_selection’
- ‘variable_selection_best_iteration’
- ‘variable_selection_min_nb_of_final_variables’
- ‘variable_selection_max_nb_of_final_variables’
- ‘variable_selection_mode’
- ‘variable_selection_nb_variables_removed_by_step’
- ‘variable_selection_percentage_of_contribution_kept_by_step’
- ‘variable_selection_quality_bar’
- ‘variable_selection_quality_criteria’
See Common APL Aliases for Model Training in SAP HANA APL Reference Guide.
Notes
It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.
Examples
>>> from hana_ml.algorithms.apl.regression import AutoRegressor >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA Database
>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS') >>> # -- Creates Hana DataFrame >>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = AutoRegressor(conn_context=CONN, variable_auto_selection=True) >>> model.fit(hana_df, label='age', ... features=['workclass', 'fnlwgt', 'education', 'education-num', 'marital-status'], ... key='id')
Making a prediction
>>> applyout_df = model.predict(hana_df) >>> print(applyout_df.head(5).collect()) id TRUE_LABEL PREDICTED 0 30 49 42 1 63 48 42 2 66 36 42 3 110 42 42 4 335 53 42
Debriefing
>>> model.get_performance_metrics() OrderedDict([('L1', 8.59885654599923), ('L2', 11.012352163260505)...
>>> model.get_feature_importances() OrderedDict([('marital-status', 0.7916100739306074), ('education-num', 0.13524836400650087)
Saving the model
>>> model.save_model(schema_name='MySchema', table_name='MyTable',if_exists='replace')
Reloading the model and making another prediction
>>> model2 = AutoRegressor(conn_context=CONN) >>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df) >>> applyout2.head(5).collect() id TRUE_LABEL PREDICTED 0 30 49 42 1 63 48 42 2 66 36 42 3 110 42 42 4 335 53 42
Methods
fit
(data[, key, features, label])Fits the model. get_feature_importances
([method])Gets the feature importances (MaximumSmartVariableContribution). get_fit_operation_log
()Retrieves the operation log table after the model training. get_indicators
()Retrieves the Indicator table after model training. get_params
()Retrieves attributes of the current object. get_performance_metrics
()Gets the model performance metrics of the last model training. get_predict_operation_log
()Retrieves the operation log table after the model training. get_summary
()Retrieves the summary table after model training. load_model
(schema_name, table_name[, oid])Loads the model from a table. predict
(data)Makes prediction with a fitted model. save_artifact
(artifact_df, schema_name, …)Saves an artifact, a temporary table, into a permanent table. save_model
(schema_name, table_name[, …])Saves the model into a table. score
(data)Returns the coefficient of determination R^2 of the prediction. set_params
(**parameters)Sets attributes of the current model. -
fit
(data, key=None, features=None, label=None)¶ Fits the model.
Parameters: - data : DataFrame
The training dataset
- key : str, optional
The name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, default will be to all the non-ID and non-label columns.
- label : str, optional
The name of the label column. Default is the last column.
Returns: - self : object
Notes
It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with a key, because the model will not expect it.
-
get_feature_importances
(method=None)¶ Gets the feature importances (MaximumSmartVariableContribution).
Parameters: - method : str, optional
The method to be used to measure the feature contributions. It is not used for binary-classification and regression. It is only used for gradient boosting algorithm.
Returns: - feature importances : An OrderedDict { feature_name
-
get_fit_operation_log
()¶ Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
Returns: - The reference to OPERATION_LOG table : hana_ml DataFrame
- This table provides detailed logs of the last model training
-
get_indicators
()¶ Retrieves the Indicator table after model training.
Returns: - The reference to INDICATORS table : hana_ml DataFrame
- This table provides the performance metrics of the last model training
-
get_params
()¶ Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
Returns: - The attribute-values of the model : dictionary
-
get_performance_metrics
()¶ Gets the model performance metrics of the last model training.
Returns: - An OrderedDict with metric name as key and metric value as value.
- For example:
- rderedDict([(‘L1’, 8.59885654599923),
(‘L2’, 11.012352163260505), (‘LInf’, 67.0), (‘ErrorMean’, 0.33833594458645944), …
-
get_predict_operation_log
()¶ Retrieves the operation log table after the model training.
Returns: - The reference to the OPERATION_LOG table : hana_ml DataFrame
- This table provides detailed logs about the last prediction
-
get_summary
()¶ Retrieves the summary table after model training.
Returns: - The reference to the SUMMARY table : hana_ml DataFrame
- This contains execution summary of the last model training
-
load_model
(schema_name, table_name, oid=None)¶ Loads the model from a table.
Parameters: - schema_name: str
The schema name
- table_name: str
The table name
- oid : str. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
-
predict
(data)¶ Makes prediction with a fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.
Parameters: - data : hana_ml DataFrame
The dataset used for prediction
Returns: - Prediction output: a hana_ml DataFrame.
- The dataframe contains the following columns:
- - KEY : the key column if it was provided in the dataset
- - TRUE_LABEL : the true value if it was provided in the dataset
- - PREDICTED : the predicted value
-
save_artifact
(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)¶ Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
Parameters: - schema_name: str
The schema name
- artifact_df : hana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
- fail: Raises a ValueError
- replace: Drops the table before inserting new values
- append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact( ... artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace' ... )
-
save_model
(schema_name, table_name, if_exists='fail', new_oid=None)¶ Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
Parameters: - schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
- fail: Raises an Error
- replace: Drops the table before inserting new values
- append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
Returns: - None
- The model is saved into a table with the following columns:
- “OID” NVARCHAR(50), – Serve as ID
- “FORMAT” NVARCHAR(50), – APL technical info
- “LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model
-
score
(data)¶ Returns the coefficient of determination R^2 of the prediction.
Parameters: - data : hana_ml DataFrame
The dataset used for prediction. It must contain the true value so that the score could be computed.
Returns: - mean average accuracy: float
-
set_params
(**parameters)¶ Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.
Parameters: - params : dictionary
The attribute names and values