hana_ml.algorithms.apl package¶
The Algorithms APL Package consists of the following sections:
hana_ml.algorithms.apl.classification¶
This module provides the SAP HANA APL binary classification algorithm.
The following classes are available:
-
class
hana_ml.algorithms.apl.classification.
AutoClassifier
(conn_context, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)¶ Bases:
hana_ml.algorithms.apl.robust_regression_base.RobustRegressionBase
SAP HANA APL Binary Classifier algorithm.
- Parameters
- conn_contextConnectionContext
The connection object to an SAP HANA database
- variable_auto_selectionbool, optional
When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables.
- polynomial_degreeint, optional
The polynomial degree of the model. Default is 1.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value type (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.
- extra_applyout_settings: dict optional
Defines other outputs the model should generate in addition to the predicted values. For example: {‘APL/ApplyReasonCode’:’3;Mean;Below;False’} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Reference Guide.
- other_params: dict optional
Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
‘correlations_lower_bound’
‘correlations_max_kept’
‘cutting_strategy’
‘exclude_low_predictive_confidence’
‘risk_fitting’
‘risk_fitting_min_cumulated_frequency’
‘risk_fitting_nb_pdo’
‘risk_fitting_use_weights’
‘risk_gdo’
‘risk_mode’
‘risk_pdo’
‘risk_score’
‘score_bins_count’
‘target_key’
‘variable_selection_best_iteration’
‘variable_selection_min_nb_of_final_variables’
‘variable_selection_max_nb_of_final_variables’
‘variable_selection_mode’
‘variable_selection_nb_variables_removed_by_step’
‘variable_selection_percentage_of_contribution_kept_by_step’
‘variable_selection_quality_bar’
‘variable_selection_quality_criteria’
See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.
Notes
It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.
By default, when it is not given, SAP HANA APL guesses the variable description by reading the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values in these parameters, the user can overwrite the default guess. For example:
model.set_params( variable_storages = { 'ID': 'integer', 'sepal length (cm)': 'number' }) model.set_params( variable_value_types = { 'sepal length (cm)': 'continuous' }) model.set_params( variable_missing_strings = { 'sepal length (cm)': '-1' }) model.set_params( extra_applyout_settings={ 'APL/ApplyReasonCode':'3;Mean;Below;False' })
Examples
>>> from hana_ml.algorithms.apl.classification import AutoClassifier >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS) >>> # -- Creates hana_ml DataFrame >>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = AutoClassifier(conn_context=CONN, variable_auto_selection=True) >>> model.fit(hana_df, label='class', key='id')
Making the predictions
>>> apply_out = model.predict(hana_df) >>> print(apply_out.head(3).collect()) id TRUE_LABEL PREDICTED PROBABILITY 0 30 0 0 0.688153 1 63 0 0 0.677693 2 66 0 0 0.700221
Adding individual contributions to the output of predictions
>>> model.set_params( ... extra_applyout_settings={ ... 'APL/ApplyContribution': 'all' ... }) >>> apply_out = model.predict(hana_df) >>> print(apply_out.head(3).collect()) id TRUE_LABEL PREDICTED PROBABILITY contrib_age_rr_class ... 0 30 0 0 0.688153 0.043387 ... 1 63 0 0 0.677693 0.042608 ... 2 66 0 0 0.700221 0.020784 ...
Adding reason codes to the output of predictions
>>> model.set_params( ... extra_applyout_settings={ ... 'APL/ApplyReasonCode':'3;Mean;Below;False' ... }) >>> apply_out = model.predict(hana_df) >>> print(apply_out.head(3).collect()) id TRUE_LABEL PREDICTED PROBABILITY RCN_B_Mean_1_rr_class ... 0 30 0 0 0.688153 education-num ... 1 63 0 0 0.677693 education-num ... 2 66 0 0 0.700221 education-num ...
Debriefing
>>> model.get_performance_metrics() OrderedDict([('L1', 0.2522171212463023), ('L2', 0.32254434028379236), ...
>>> model.get_feature_importances() OrderedDict([('marital-status', 0.2172766583204266), ('capital-gain', 0.19521247617062215),...
Saving the model in the schema named ‘MODEL_STORAGE’. Please see model_storage class for further features of model storage.
>>> from hana_ml.model_storage import ModelStorage >>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My classification model name' >>> model_storage.save_model(model=model, if_exists='replace')
- Attributes
- model_hana_ml DataFrame
The trained model content
- summary_APLArtifactTable
The reference to the “SUMMARY” table generated by the model training. This table contains the content of the model training summary.
- indicators_APLArtifactTable
The reference to the “INDICATORS” table generated by the model training. This table contains various metrics related to model and model variables.
- fit_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table generated by the model training
- var_desc_APLArtifactTable
The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training
- applyout_hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table when a prediction was made
Methods
fit
(self, data[, key, features, label])Fits the model.
get_feature_importances
(self)Returns the feature importances (MaximumSmartVariableContribution).
get_fit_operation_log
(self)Retrieves the operation log table after the model training.
get_indicators
(self)Retrieves the Indicator table after model training.
get_params
(self)Retrieves attributes of the current object.
get_performance_metrics
(self)Returns the performance metrics of the last trained model.
Retrieves the operation log table after the model training.
get_summary
(self)Retrieves the summary table after model training.
is_fitted
(self)Checks if the model can be saved.
load_model
(self, schema_name, table_name[, oid])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
predict
(self, data)Makes predictions with the fitted model.
save_artifact
(self, artifact_df, …[, …])Saves an artifact, a temporary table, into a permanent table.
save_model
(self, schema_name, table_name[, …])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
score
(self, data)Returns the mean accuracy on the provided test dataset.
set_params
(self, \*\*parameters)Sets attributes of the current model.
-
fit
(self, data, key=None, features=None, label=None)¶ Fits the model.
- Parameters
- dataDataFrame
The training dataset
- keystr, optional
The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended usage. See notes below.
- featureslist of str, optional
The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.
- labelstr, optional
The name of the label column. Default is the last column.
- Returns
- selfobject
Notes
It is highly recommended to use a dataset with key in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with key, because the model will not expect it.
-
predict
(self, data)¶ Makes predictions with the fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.
- Parameters
- datahana_ml DataFrame
The dataset used for prediction
- Returns
- Prediction output: hana_ml DataFrame
- The dataframe contains the following columns:
- - KEYthe key column if it was provided in the dataset
- - TRUE_LABELthe class label when it was given in the dataset
- - PREDICTEDthe predicted label
- - PROBABILITYthe probability of the predicted label to be correct (confidence)
- - SCORING_VALUEthe unnormalized scoring value
-
score
(self, data)¶ Returns the mean accuracy on the provided test dataset.
- Parameters
- datahana_ml DataFrame
The test dataset used to compute the score. The labels must be provided in the dataset.
- Returns
- mean average accuracy: float
-
get_feature_importances
(self)¶ Returns the feature importances (MaximumSmartVariableContribution).
- Returns
- feature importancesAn OrderedDict { feature_name
-
get_fit_operation_log
(self)¶ Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs of the last model training
-
get_indicators
(self)¶ Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
- This table provides the performance metrics of the last model training
-
get_params
(self)¶ Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
-
get_performance_metrics
(self)¶ Returns the performance metrics of the last trained model.
- Returns
- An OrderedDict with metric name as key and metric value as value.
- For example:
- OrderedDict([(‘L1’, 8.59885654599923),
(‘L2’, 11.012352163260505), (‘LInf’, 67.0), (‘ErrorMean’, 0.33833594458645944), …
-
get_predict_operation_log
(self)¶ Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs about the last prediction
-
get_summary
(self)¶ Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
- This contains execution summary of the last model training
-
is_fitted
(self)¶ Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
-
load_model
(self, schema_name, table_name, oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Loads the model from a table.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
-
save_artifact
(self, artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)¶ Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact( ... artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace' ... )
-
save_model
(self, schema_name, table_name, if_exists='fail', new_oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
- The model is saved into a table with the following columns:
“OID” NVARCHAR(50), – Serve as ID
“FORMAT” NVARCHAR(50), – APL technical info
“LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model
-
set_params
(self, **parameters)¶ Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.
- Parameters
- paramsdictionary
The attribute names and values
hana_ml.algorithms.apl.regression¶
This module contains SAP HANA APL regression algorithm.
The following classes are available:
-
class
hana_ml.algorithms.apl.regression.
AutoRegressor
(conn_context, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)¶ Bases:
hana_ml.algorithms.apl.robust_regression_base.RobustRegressionBase
This module provides the SAP HANA APL regression algorithm.
- Parameters
- conn_contextConnectionContext
The connection object to an SAP HANA database
- variable_auto_selectionbool optional
When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables
- polynomial_degreeint optional
The polynomial degree of the model. Default is 1.
- variable_storages: dict optional
Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.
- variable_value_types: dict optional
Specifies the variable value type (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.
- extra_applyout_settings: dict optional
Defines other outputs the model should generate in addition to the predicted values. For example: {‘APL/ApplyReasonCode’:’3;Mean;Below;False’} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Reference Guide.
- other_params: dict optional
Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
‘correlations_lower_bound’
‘correlations_max_kept’
‘cutting_strategy’
‘exclude_low_predictive_confidence’
‘risk_fitting’
‘risk_fitting_min_cumulated_frequency’
‘risk_fitting_nb_pdo’
‘risk_fitting_use_weights’
‘risk_gdo’
‘risk_mode’
‘risk_pdo’
‘risk_score’
‘score_bins_count’
‘variable_auto_selection’
‘variable_selection_best_iteration’
‘variable_selection_min_nb_of_final_variables’
‘variable_selection_max_nb_of_final_variables’
‘variable_selection_mode’
‘variable_selection_nb_variables_removed_by_step’
‘variable_selection_percentage_of_contribution_kept_by_step’
‘variable_selection_quality_bar’
‘variable_selection_quality_criteria’
See Common APL Aliases for Model Training in SAP HANA APL Reference Guide.
Notes
It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.
Examples
>>> from hana_ml.algorithms.apl.regression import AutoRegressor >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA Database
>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS') >>> # -- Creates Hana DataFrame >>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = AutoRegressor(conn_context=CONN, variable_auto_selection=True) >>> model.fit(hana_df, label='age', ... features=['workclass', 'fnlwgt', 'education', 'education-num', 'marital-status'], ... key='id')
Making a prediction
>>> applyout_df = model.predict(hana_df) >>> print(applyout_df.head(5).collect()) id TRUE_LABEL PREDICTED 0 30 49 42 1 63 48 42 2 66 36 42 3 110 42 42 4 335 53 42
Debriefing
>>> model.get_performance_metrics() OrderedDict([('L1', 8.59885654599923), ('L2', 11.012352163260505)...
>>> model.get_feature_importances() OrderedDict([('marital-status', 0.7916100739306074), ('education-num', 0.13524836400650087)
Saving the model in the schema named ‘MODEL_STORAGE’ Please see model_storage class for further features of model storage
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My regression model name' >>> model_storage.save_model(model=model, if_exists='replace')
Reloading the model and making another prediction
>>> model2 = AutoRegressor(conn_context=CONN) >>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df) >>> applyout2.head(5).collect() id TRUE_LABEL PREDICTED 0 30 49 42 1 63 48 42 2 66 36 42 3 110 42 42 4 335 53 42
Methods
fit
(self, data[, key, features, label])Fits the model.
get_feature_importances
(self)Returns the feature importances (MaximumSmartVariableContribution).
get_fit_operation_log
(self)Retrieves the operation log table after the model training.
get_indicators
(self)Retrieves the Indicator table after model training.
get_params
(self)Retrieves attributes of the current object.
get_performance_metrics
(self)Returns the performance metrics of the last trained model.
Retrieves the operation log table after the model training.
get_summary
(self)Retrieves the summary table after model training.
is_fitted
(self)Checks if the model can be saved.
load_model
(self, schema_name, table_name[, oid])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
predict
(self, data)Makes prediction with a fitted model.
save_artifact
(self, artifact_df, …[, …])Saves an artifact, a temporary table, into a permanent table.
save_model
(self, schema_name, table_name[, …])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
score
(self, data)Returns the coefficient of determination R^2 of the prediction.
set_params
(self, \*\*parameters)Sets attributes of the current model.
-
fit
(self, data, key=None, features=None, label=None)¶ Fits the model.
- Parameters
- dataDataFrame
The training dataset
- keystr, optional
The name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- featureslist of str, optional
Names of the feature columns. If features is not provided, default will be to all the non-ID and non-label columns.
- labelstr, optional
The name of the label column. Default is the last column.
- Returns
- selfobject
Notes
It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with a key, because the model will not expect it.
-
predict
(self, data)¶ Makes prediction with a fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.
- Parameters
- datahana_ml DataFrame
The dataset used for prediction
- Returns
- Prediction output: a hana_ml DataFrame.
- The dataframe contains the following columns:
- - KEYthe key column if it was provided in the dataset
- - TRUE_LABELthe true value if it was provided in the dataset
- - PREDICTEDthe predicted value
-
score
(self, data)¶ Returns the coefficient of determination R^2 of the prediction.
- Parameters
- datahana_ml DataFrame
The dataset used for prediction. It must contain the true value so that the score could be computed.
- Returns
- mean average accuracy: float
-
get_feature_importances
(self)¶ Returns the feature importances (MaximumSmartVariableContribution).
- Returns
- feature importancesAn OrderedDict { feature_name
-
get_fit_operation_log
(self)¶ Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs of the last model training
-
get_indicators
(self)¶ Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
- This table provides the performance metrics of the last model training
-
get_params
(self)¶ Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
-
get_performance_metrics
(self)¶ Returns the performance metrics of the last trained model.
- Returns
- An OrderedDict with metric name as key and metric value as value.
- For example:
- OrderedDict([(‘L1’, 8.59885654599923),
(‘L2’, 11.012352163260505), (‘LInf’, 67.0), (‘ErrorMean’, 0.33833594458645944), …
-
get_predict_operation_log
(self)¶ Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs about the last prediction
-
get_summary
(self)¶ Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
- This contains execution summary of the last model training
-
is_fitted
(self)¶ Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
-
load_model
(self, schema_name, table_name, oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Loads the model from a table.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
-
save_artifact
(self, artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)¶ Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact( ... artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace' ... )
-
save_model
(self, schema_name, table_name, if_exists='fail', new_oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
- The model is saved into a table with the following columns:
“OID” NVARCHAR(50), – Serve as ID
“FORMAT” NVARCHAR(50), – APL technical info
“LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model
-
set_params
(self, **parameters)¶ Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.
- Parameters
- paramsdictionary
The attribute names and values
hana_ml.algorithms.apl.clustering¶
This module provides the SAP HANA APL clustering algorithms.
The following classes are available:
-
class
hana_ml.algorithms.apl.clustering.
AutoUnsupervisedClustering
(conn_context, nb_clusters=None, nb_clusters_min=None, nb_clusters_max=None, distance=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)¶ Bases:
hana_ml.algorithms.apl.clustering._AutoClusteringBase
SAP HANA APL unsupervised clustering algorithm.
- Parameters
- nb_clustersint, optional, default = 10
The number of clusters to create
- nb_clusters_min: int, optional
The minimum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.
- nb_clusters_max: int, optional
The maximum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.
- distance: str, optional, default = ‘SystemDetermined’
The metric used to measure the distance between data points. The possible values are: ‘L1’, ‘L2’, ‘LInf’, ‘SystemDetermined’.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value types (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals ‘???’, it will be taken as missing.
- extra_applyout_settings: dict optional
Defines the output to generate when applying the model. See documentation on predict() method for more information.
- other_params: dict optional
Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
calculate_cross_statistics
calculate_sql_expressions
cutting_strategy
encoding_strategy
See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.
Notes
The algorithm may detect less clusters than requested.
This happens when a cluster detected on the estimation dataset was not found on the validation dataset. In that case, this cluster will be considered unstable and will then be removed from the model. Users can get the number of clusters actually found in the “INDICATORS” table. For example,
# The actual number of clusters found d = model_u.get_indicators().collect() d[d.KEY=='FinalClusterCount'][['KEY','VALUE']]
It is highly recommended to use a dataset with a key provided in the fit() method.
If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.
By default, when it is not given, SAP HANA APL guesses the variable description by reading
the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values for these parameters, the user can overwrite the default guess. For example:
model.set_params( variable_storages = { 'ID': 'integer', 'sepal length (cm)': 'number' }) model.set_params( variable_value_types = { 'sepal length (cm)': 'continuous' }) model.set_params( variable_missing_strings = { 'sepal length (cm)': '-1' })
Examples
>>> from hana_ml.algorithms.apl.clustering import AutoUnsupervisedClustering >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS') >>> # -- Creates Hana DataFrame >>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = AutoUnsupervisedClustering(CONN, nb_clusters=5) >>> model.fit(data=hana_df, key='id')
Debriefing
>>> model.get_metrics() OrderedDict([('SimplifiedSilhouette', 0.3448029020802121), ('RSS', 4675.706587754118),...
>>> model.get_metrics_by_cluster() {'Frequency': {1: 0.23053242076908276, 2: 0.27434649954646656, 3: 0.09628652318517908, 4: 0.29919463456199663, 5: 0.09963992193727494}, 'IntraInertia': {1: 0.6734978174937322, 2: 0.7202839995396123, 3: 0.5516800856975772, 4: 0.6969632183111357, 5: 0.5809322138167139}, 'RSS': {1: 5648.626195319932, 2: 7189.15459940487, 3: 1932.5353401986129, 4: 7586.444631316713, 5: 2105.879275085588}, 'SimplifiedSilhouette': {1: 0.1383827622819234, 2: 0.14716862328457128, 3: 0.18753797605134545, 4: 0.13679980173383793, 5: 0.15481377834381388}, 'KL': {1: OrderedDict([('relationship', 0.4951910610641741), ('marital-status', 0.2776259711735807), ('hours-per-week', 0.20990189265572687), ('education-num', 0.1996353893520096), ('education', 0.19963538935200956), ...
Predicting which cluster a data point belongs to
>>> applyout_df = model.predict(hana_df) >>> applyout_df.collect() # returns the output as a pandas DataFrame id CLOSEST_CLUSTER_1 DISTANCE_TO_CLOSEST_CENTROID_1 0 30 3 0.640378 1 63 4 0.611050 2 66 3 0.640378 3 110 4 0.611050 4 335 1 0.851054
Determining the 2 closest clusters
>>> model.set_params(extra_applyout_settings={'mode':'closest_distances', 'nb_distances': 2}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect() # returns the output as a pandas DataFrame id CLOSEST_CLUSTER_1 ... CLOSEST_CLUSTER_2 DISTANCE_TO_CLOSEST_CENTROID_2 0 30 3 ... 4 0.730330 1 63 4 ... 1 0.851054 2 66 3 ... 4 0.730330 3 110 4 ... 1 0.851054 4 335 1 ... 4 0.906003
Retrieving the distances to all clusters
>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect() # returns the output as a pandas DataFrame id DISTANCE_TO_CENTROID_1 ... DISTANCE_TO_CENTROID_5 0 30 3 ... 1.160697 1 63 4 ... 1.160697 2 66 3 ... 1.160697
Saving the model in the schema named ‘MODEL_STORAGE’ Please model_storage class for further features of model storage.
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My model name' >>> model_storage.save_model(model=model)
Reloading the model for further use
>>> model2 = AutoUnsupervisedClustering(conn_context=CONN) >>> model2.load_model(schema_name='MySchema', table_name='MyTable') >>> applyout2 = model2.predict(hana_df) >>> applyout2.head(3).collect() id CLOSEST_CLUSTER_1 DISTANCE_TO_CLOSEST_CENTROID_1 0 30 3 0.640378 1 63 4 0.611050 2 66 3 0.640378
- Attributes
- model_hana_ml DataFrame
The trained model content
- summary_APLArtifactTable
The reference to the “SUMMARY” table generated by the model training. This table contains the summary about the model training.
- indicators_APLArtifactTable
The reference to the “INDICATORS” table generated by the model training. This table contains the various metrics related to the model and its variables.
- fit_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table generated by the model training
- var_desc_APLArtifactTable
The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training
- applyout_hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table when a prediction was made
Methods
fit
(self, data[, key, features])Fits the model.
fit_predict
(self, data[, key, features])Fits a clustering model and uses it to generate prediction output on the training dataset.
get_fit_operation_log
(self)Retrieves the operation log table after the model training.
get_indicators
(self)Retrieves the Indicator table after model training.
get_metrics
(self)Returns a dictionary containing the metrics about the model.
get_params
(self)Retrieves attributes of the current object.
Retrieves the operation log table after the model training.
get_summary
(self)Retrieves the summary table after model training.
is_fitted
(self)Checks if the model can be saved.
load_model
(self, schema_name, table_name[, oid])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
predict
(self, data)Predicts which cluster each specified row belongs to.
save_artifact
(self, artifact_df, …[, …])Saves an artifact, a temporary table, into a permanent table.
save_model
(self, schema_name, table_name[, …])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
set_params
(self, \*\*parameters)Sets attributes of the current model.
-
fit
(self, data, key=None, features=None)¶ Fits the model.
- Parameters
- datahana_ml DataFrame
The training dataset
- keystr, optional
The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended.
- featureslist of str, optional
The names of the features to be used in the model. If features is not provided, all columns will be taken except the ID column.
- Returns
- selfobject
-
fit_predict
(self, data, key=None, features=None)¶ Fits a clustering model and uses it to generate prediction output on the training dataset.
- Parameters
- datahana_ml DataFrame
The input dataset
- keystr, optional
The name of the ID column.
- featureslist of str, optional.
The names of the feature columns. If features is not provided, all non-ID columns will be taken.
- Returns
- hana_ml DataFrame.
- The output is the same as the predict() method.
Notes
Please see the predict() method so as to get different outputs with the ‘extra_applyout_settings’ parameter.
-
get_metrics
(self)¶ Returns a dictionary containing the metrics about the model.
- Returns
- A dictionary object containing a set of clustering metrics and their values
Examples
>>> model.get_metrics() {'SimplifiedSilhouette': 0.14668968897882997, 'RSS': 24462.640041325714, 'IntraInertia': 3.2233573348587714, 'KL': {1: OrderedDict([('hours-per-week', 0.2971627592049324), ('occupation', 0.11944355994892383), ('relationship', 0.06772624975990414), ('education-num', 0.06377345492340795), ('education', 0.06377345492340793), ...}
-
get_fit_operation_log
(self)¶ Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs of the last model training
-
get_indicators
(self)¶ Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
- This table provides the performance metrics of the last model training
-
get_params
(self)¶ Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
-
get_predict_operation_log
(self)¶ Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs about the last prediction
-
get_summary
(self)¶ Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
- This contains execution summary of the last model training
-
is_fitted
(self)¶ Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
-
load_model
(self, schema_name, table_name, oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Loads the model from a table.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
-
predict
(self, data)¶ Predicts which cluster each specified row belongs to.
- Parameters
- datahana_ml DataFrame
The set of rows for which to generate cluster predictions. This dataset must have the same structure as the one used in the fit() method.
- Returns
- hana_ml DataFrame
By default, the ID of the closest cluster and the distance to its center are provided. Users can request different outputs by setting the extra_applyout_settings parameter in the model. The extra_applyout_settings parameter is a dictionary with ‘mode’ and ‘nb_distances’ as keys. If mode is set to ‘closest_distances’, cluster IDs and distances to centroids will be provided from the closest to the furthest cluster. The output columns will be:
<The key column name>,
CLOSEST_CLUSTER_1,
DISTANCE_TO_CLOSEST_CENTROID_1,
CLOSEST_CLUSTER_2,
DISTANCE_TO_CLOSEST_CENTROID_2,
…
If mode is set to ‘all_distances’, the distances to the centroids of all clusters will be provided in cluster ID order. The output columns will be:
ID,
DISTANCE_TO_CENTROID_1,
DISTANCE_TO_CENTROID_2,
…
nb_distances limits the output to the closest clusters. It is only valid when mode is ‘closest_distances’ (it will be ignored if mode = ‘all distances’). It can be set to ‘all’ or a positive integer.
Examples
Retrieves the IDs of the 3 closest clusters and the distances to their centroids:
>>> extra_applyout_settings = {'mode': 'closest_distances', 'nb_distances': '3'} >>> model.set_params(extra_applyout_settings=extra_applyout_settings) >>> out = model.predict(hana_df) >>> out.head(3).collect() id CLOSEST_CLUSTER_1 ... CLOSEST_CLUSTER_3 DISTANCE_TO_CLOSEST_CENTROID_3 0 30 3 ... 4 0.730330 1 63 4 ... 1 0.851054 2 66 3 ... 4 0.730330
Retrieves the distances to all clusters:
>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'}) >>> out = model.predict(hana_df) >>> out.head(3).collect() id DISTANCE_TO_CENTROID_1 DISTANCE_TO_CENTROID_2 ... DISTANCE_TO_CENTROID_5 0 30 0.994595 0.877414 ... 0.782949 1 63 0.994595 0.985202 ... 0.782949 2 66 0.994595 0.877414 ... 0.782949
-
save_artifact
(self, artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)¶ Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact( ... artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace' ... )
-
save_model
(self, schema_name, table_name, if_exists='fail', new_oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
- The model is saved into a table with the following columns:
“OID” NVARCHAR(50), – Serve as ID
“FORMAT” NVARCHAR(50), – APL technical info
“LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model
-
set_params
(self, **parameters)¶ Sets attributes of the current model.
- Parameters
- paramsdictionary
The set of parameters with their new values
-
class
hana_ml.algorithms.apl.clustering.
AutoSupervisedClustering
(conn_context, label=None, nb_clusters=None, nb_clusters_min=None, nb_clusters_max=None, distance=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)¶ Bases:
hana_ml.algorithms.apl.clustering._AutoClusteringBase
SAP HANA APL Supervised Clustering algorithm. Clusters are determined with respect to a label variable.
- Parameters
- label: str,
The name of the label column
- nb_clustersint, optional, default = 10
The number of clusters to create
- nb_clusters_min: int, optional
The minimum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.
- nb_clusters_max: int, optional
The maximum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.
- distance: str, optional, default = ‘SystemDetermined’
The metric used to measure the distance between data points. The possible values are: ‘L1’, ‘L2’, ‘LInf’, ‘SystemDetermined’.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value types (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals ‘???’, it will be taken as missing.
- extra_applyout_settings: dict optional
Defines the output to generate when applying the model. See documentation on predict() method for more information.
- other_params: dict optional
Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
calculate_cross_statistics
calculate_sql_expressions
cutting_strategy
encoding_strategy
See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.
Notes
The algorithm may detect less clusters than requested.
This happens when a cluster detected on the estimation dataset was not found on the validation dataset. In that case, this cluster will be considered unstable and will then be removed from the model. Users can get the number of clusters actually found in the “INDICATORS” table. For example,
# The actual number of clusters found d = model_u.get_indicators().collect() d[d.KEY=='FinalClusterCount'][['KEY','VALUE']]
It is highly recommended to use a dataset with a key provided in the fit() method.
If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.
By default, when it is not given, SAP HANA APL guesses the variable description by reading
the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values for these parameters, the user can overwrite the default guess. For example:
model.set_params( variable_storages = { 'ID': 'integer', 'sepal length (cm)': 'number' }) model.set_params( variable_value_types = { 'sepal length (cm)': 'continuous' }) model.set_params( variable_missing_strings = { 'sepal length (cm)': '-1' })
Examples
>>> from hana_ml.algorithms.apl.clustering import AutoSupervisedClustering >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS') >>> # -- Creates Hana DataFrame >>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = AutoSupervisedClustering(CONN, nb_clusters=5) >>> model.fit(data=hana_df, key='id', label='class')
Debriefing
>>> model.get_metrics() OrderedDict([('SimplifiedSilhouette', 0.3448029020802121), ('RSS', 4675.706587754118),...
>>> model.get_metrics_by_cluster() {'Frequency': {1: 0.15139770759462357, 2: 0.39707539649817214, 3: 0.21549710013468568, 4: 0.12949066820593166, 5: 0.10653912756658696}, 'IntraInertia': {1: 0.1604412809425719, 2: 0.10561882166246073, 3: 0.12004212490063185, 4: 0.21030892961293207, 5: 0.08625667904000194}, 'RSS': {1: 883.710575431686, 2: 1525.7694977359076, 3: 941.1302592209537, 4: 990.765367406523, 5: 334.3308879590475}, 'SimplifiedSilhouette': {1: 0.3355726073943343, 2: 0.4231738907945281, 3: 0.2448648428415369, 4: 0.38136325589137554, 5: 0.22353657540054947}, 'TargetMean': {1: 0.1744734931009441, 2: 0.022912917070469333, 3: 0.3895408163265306, 4: 0.7537677775419231, 5: 0.21207430340557276}, 'TargetStandardDeviation': {1: 0.37951613049526484, 2: 0.14962591788119842, 3: 0.48764615116105525, 4: 0.4308154072006165, 5: 0.40877719266198526}, 'KL': {1: OrderedDict([('relationship', 0.6840012706191696), ('education', 0.675109873839992), ('education-num', 0.6751098738399919), ('marital-status', 0.5806503390741476), ('occupation', 0.46891689485806354), ('sex', 0.08802303491483551), ('capital-gain', 0.08794254258565125), ...
Predicting which cluster a data point belongs to
>>> applyout_df = model.predict(hana_df) >>> applyout_df.collect() # returns the output as a pandas DataFrame id CLOSEST_CLUSTER_1 DISTANCE_TO_CLOSEST_CENTROID_1 0 30 3 0.640378 1 63 4 0.611050 2 66 3 0.640378 3 110 4 0.611050 4 335 1 0.851054
Determining the 2 closest clusters
>>> model.set_params(extra_applyout_settings={'mode':'closest_distances', 'nb_distances': 2}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect() # returns the output as a pandas DataFrame id CLOSEST_CLUSTER_1 ... CLOSEST_CLUSTER_2 DISTANCE_TO_CLOSEST_CENTROID_2 0 30 3 ... 4 0.730330 1 63 4 ... 1 0.851054 2 66 3 ... 4 0.730330 3 110 4 ... 1 0.851054 4 335 1 ... 4 0.906003
Retrieving the distances to all clusters
>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect() # returns the output as a pandas DataFrame id DISTANCE_TO_CENTROID_1 ... DISTANCE_TO_CENTROID_5 0 30 0.851054 ... 1.160697 1 63 0.751054 ... 1.160697 2 66 0.906003 ... 1.160697
Saving the model in the schema named ‘MODEL_STORAGE’ Please see model_storage class for further features of model storage.
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My model name' >>> model_storage.save_model(model=model, if_exists='replace')
Reloading the model for further uses Please note that the label has to be specified again prior to calling predict()
>>> model2 = AutoSupervisedClustering(conn_context=CONN) >>> model2.set_params(label='class') >>> model2.load_model(schema_name='MySchema', table_name='MyTable') >>> applyout2 = model2.predict(hana_df) >>> applyout2.head(3).collect() id CLOSEST_CLUSTER_1 DISTANCE_TO_CLOSEST_CENTROID_1 0 30 3 0.640378 1 63 4 0.611050 2 66 3 0.640378
- Attributes
- model_hana_ml DataFrame
The trained model content
- summary_APLArtifactTable
The reference to the “SUMMARY” table generated by the model training. This table contains the summary about the model training.
- indicators_APLArtifactTable
The reference to the “INDICATORS” table generated by the model training. This table contains the various metrics related to the model and its variables.
- fit_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table generated by the model training
- var_desc_APLArtifactTable
The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training
- applyout_hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table when a prediction was made
Methods
fit
(self, data[, key, label, features])Fits the model.
fit_predict
(self, data[, key, label, features])Fits a clustering model and uses it to generate prediction output on the training dataset.
get_fit_operation_log
(self)Retrieves the operation log table after the model training.
get_indicators
(self)Retrieves the Indicator table after model training.
get_metrics
(self)Returns a dictionary containing the metrics about the model.
get_params
(self)Retrieves attributes of the current object.
Retrieves the operation log table after the model training.
get_summary
(self)Retrieves the summary table after model training.
is_fitted
(self)Checks if the model can be saved.
load_model
(self, schema_name, table_name[, oid])Loads the model from a table.
predict
(self, data)Predicts which cluster each specified row belongs to.
save_artifact
(self, artifact_df, …[, …])Saves an artifact, a temporary table, into a permanent table.
save_model
(self, schema_name, table_name[, …])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
set_params
(self, \*\*parameters)Sets attributes of the current model
-
set_params
(self, **parameters)¶ Sets attributes of the current model
- Parameters
- paramsdictionary
containing attribute names and values
-
fit
(self, data, key=None, label=None, features=None)¶ Fits the model.
- Parameters
- datahana_ml DataFrame
The training dataset
- keystr, optional
The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended.
- labelstr, option
The name of the label column. If it is not given, the model ‘label’ attribute will be taken. If this latter is not defined, an error will be raised.
- featureslist of str, optional
The names of the features to be used in the model. If features is not provided, all columns will be taken except the ID and the label columns.
- Returns
- selfobject
-
predict
(self, data)¶ Predicts which cluster each specified row belongs to.
- Parameters
- datahana_ml DataFrame
The set of rows for which to generate cluster predictions. This dataset must have the same structure as the one used in the fit() method.
- Returns
- hana_ml DataFrame
By default, the ID of the closest cluster and the distance to its center are provided. Users can request different outputs by setting the extra_applyout_settings parameter in the model. The extra_applyout_settings parameter is a dictionary with ‘mode’ and ‘nb_distances’ as keys. If mode is set to ‘closest_distances’, cluster IDs and distances to centroids will be provided from the closest to the furthest cluster. The output columns will be:
<The key column name>,
CLOSEST_CLUSTER_1,
DISTANCE_TO_CLOSEST_CENTROID_1,
CLOSEST_CLUSTER_2,
DISTANCE_TO_CLOSEST_CENTROID_2,
…
If mode is set to ‘all_distances’, the distances to the centroids of all clusters will be provided in cluster ID order. The output columns will be:
ID,
DISTANCE_TO_CENTROID_1,
DISTANCE_TO_CENTROID_2,
…
nb_distances limits the output to the closest clusters. It is only valid when mode is ‘closest_distances’ (it will be ignored if mode = ‘all distances’). It can be set to ‘all’ or a positive integer.
Examples
Retrieves the IDs of the 3 closest clusters and the distances to their centroids:
>>> extra_applyout_settings = {'mode': 'closest_distances', 'nb_distances': 3} >>> model.set_params(extra_applyout_settings=extra_applyout_settings) >>> out = model.predict(hana_df) >>> out.head(3).collect() id CLOSEST_CLUSTER_1 ... CLOSEST_CLUSTER_3 DISTANCE_TO_CLOSEST_CENTROID_3 0 30 3 ... 4 0.730330 1 63 4 ... 1 0.851054 2 66 3 ... 4 0.730330
Retrieves the distances to all clusters:
>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'}) >>> out = model.predict(hana_df) >>> out.head(3).collect() id DISTANCE_TO_CENTROID_1 DISTANCE_TO_CENTROID_2 ... DISTANCE_TO_CENTROID_5 0 30 0.994595 0.877414 ... 0.782949 1 63 0.994595 0.985202 ... 0.782949 2 66 0.994595 0.877414 ... 0.782949
-
fit_predict
(self, data, key=None, label=None, features=None)¶ Fits a clustering model and uses it to generate prediction output on the training dataset.
- Parameters
- datahana_ml DataFrame
The input dataset
- keystr, optional
The name of the ID column
- labelstr
The name of the label column
- featureslist of str, optional.
The names of the feature columns. If features is not provided, all non-ID and non-label columns will be taken.
- Returns
- hana_ml DataFrame.
- The output is the same as the predict() method.
Notes
Please see the predict() method so as to get different outputs with the ‘extra_applyout_settings’ parameter.
-
get_metrics
(self)¶ Returns a dictionary containing the metrics about the model.
- Returns
- A dictionary object containing a set of clustering metrics and their values
Examples
>>> model.get_metrics() {'SimplifiedSilhouette': 0.14668968897882997, 'RSS': 24462.640041325714, 'IntraInertia': 3.2233573348587714, 'Frequency': { 1: 0.3167862345729914, 2: 0.35590005772243755, 3: 0.3273137077045711}, 'IntraInertia': {1: 0.7450335510518645, 2: 0.708350629565789, 3: 0.7006679558645009}, 'RSS': {1: 8586.511675872738, 2: 9171.723951617836, 3: 8343.554018434477}, 'SimplifiedSilhouette': {1: 0.13324659043317924, 2: 0.14182734764281074, 3: 0.1311620470933516}, 'TargetMean': {1: 0.1744734931009441, 2: 0.022912917070469333, 3: 0.3895408163265306}, 'TargetStandardDeviation': {1: 0.37951613049526484, 2: 0.14962591788119842, 3: 0.48764615116105525}, 'KL': {1: OrderedDict([('hours-per-week', 0.2971627592049324), ('occupation', 0.11944355994892383), ('relationship', 0.06772624975990414), ('education-num', 0.06377345492340795), ('education', 0.06377345492340793), ...
-
load_model
(self, schema_name, table_name, oid=None)¶ Loads the model from a table.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the OID must be given as an identifier. If it is not provided, the whole table is read.
- Notice
- ——
- Prior to using a reloaded model for a new prediction, it is necessary to re-specify
- the ‘label’ parameter. Otherwise, the predict() method will fail.
-
get_fit_operation_log
(self)¶ Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs of the last model training
-
get_indicators
(self)¶ Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
- This table provides the performance metrics of the last model training
-
get_params
(self)¶ Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
-
get_predict_operation_log
(self)¶ Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs about the last prediction
-
get_summary
(self)¶ Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
- This contains execution summary of the last model training
-
is_fitted
(self)¶ Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
-
save_artifact
(self, artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)¶ Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact( ... artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace' ... )
-
save_model
(self, schema_name, table_name, if_exists='fail', new_oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
- The model is saved into a table with the following columns:
“OID” NVARCHAR(50), – Serve as ID
“FORMAT” NVARCHAR(50), – APL technical info
“LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model
hana_ml.algorithms.apl.time_series¶
This module contains the SAP HANA APL Time Series algorithm.
The following class is available:
-
class
hana_ml.algorithms.apl.time_series.
AutoTimeSeries
(conn_context, time_column_name=None, target=None, horizon=1, with_extra_predictable=True, last_training_time_point=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, train_data_=None, **other_params)¶ Bases:
hana_ml.algorithms.apl.apl_base.APLBase
SAP HANA APL Time Series algorithm.
- target: str
The name of the column containing the time series data points.
- time_column_name: str
The name of the column containing the time series time points. The time column is used as table key. It can be overridden by setting the ‘key’ parameter through the fit() method.
- last_training_time_point: str, optional
The last time point used for model training. The training dataset will contain all data points up to this date. By default, this parameter will be set as the last time point until which the target is not null.
- horizon: int, optional
The number of forecasts to be generated by the model upon apply. The time series model will be trained to optimize accuracy on the requested horizon only. The default value is 1.
- with_extra_predictable: bool, optional
If set to true, all input variables will be used by the model to generate forecasts. If set to false, only the time and target columns will be used. All other variables will be ignored. This parameter is set to true by default.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value types (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals ‘???’, it will be taken as missing.
- extra_applyout_settings: dict, optional
Specifies the prediction outputs. See documentation on predict() method for more details.
- other_params: dict, optional
Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are: - force_negative_forecast - force_positive_forecast - forecast_fallback_method - forecast_max_cyclics - forecast_max_lags - forecast_method - smoothing_cycle_length See Common APL Aliases for Model Training in the SAP HANA APL Reference Guide.
Notes
The input dataset, given as an hana_ml dataframe, must not be a temporary table because the API tries to create a view sorted by the time column. SAP HANA does not allow user to create a view on temporary table.
When calling the fit_predict() method, the time series model is generated on the fly and not returned. If a model must be saved, please consider using the fit() method instead.
When extra-predictable variables are involved, it is usual to have a single dataset used both for the model training and the forecasting. In this case, the dataset should contain two successive periods:
The first one is used for the model training, ranging from the beginning to the last date where the target value is not null.
The second one is used for the model training, ranging from the the first date where the target value is null.
The content of the output of the get_performance_metrics() method may change depending of the version of SAP HANA APL used with this API. Please refer to the SAP HANA APL documentation to know which metrics will be provided.
Examples
>>> from hana_ml.algorithms.apl.time_series import AutoTimeSeries >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS) >>> # -- Creates Hana DataFrame >>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CASHFLOWS_FULL')
Creating and fitting the model
>>> model = AutoTimeSeries(CONN, time_column_name='Date', target='Cash', horizon=3) >>> model.fit(data=hana_df)
Debriefing
>>> model.get_model_components() {'Trend': 'Polynom( Date)', 'Cycles': 'PeriodicExtrasPred_MondayMonthInd', 'Fluctuations': 'AR(46)'}
>>> model.get_performance_metrics() {'MAPE': [0.12853715702893018, 0.12789963348617622, 0.12969031859857874], ...}
Generating forecasts using the forecast() method This method is used to generate forecasts using a signature similar to the one used in PAL. There are two variants of usage as described below:
1) If the model does not use extra-predictable variables (no exogenous variable), users must simply specify the number of forecasts.
>>> train_df = DataFrame(CONN, 'SELECT "Date" , "Cash" ' 'from APL_SAMPLES.CASHFLOWS_FULL ORDER BY 1 LIMIT 100') >>> model = AutoTimeSeries(CONN, time_column_name='Date', target='Cash', horizon=3) >>> model.fit(train_df) >>> out = model.forecast(forecast_length=3) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 98 2001-05-23 3057.812544999999772699132909775 4593.966530 NaN NaN 99 2001-05-25 3037.539714999999887176132440567 4307.893346 NaN NaN 100 2001-05-26 None 4206.023158 -3609.599872 12021.646187 101 2001-05-27 None 4575.162651 -3392.283802 12542.609104 102 2001-05-28 None 4830.352462 -3239.507360 12900.212284
2) If the model uses extra-predictable variables, users must provide the values of all extra-predictable variables for each time point of the forecast period. These values must be provided as a hana_ml dataframe with the same structure as the training dataset.
>>> # Trains the dataset with extra-predictable variables >>> train_df = DataFrame(CONN, ... 'SELECT * ' ... 'from APL_SAMPLES.CASHFLOWS_FULL ' ... 'WHERE "Cash" is not null') >>> # Extra-predictable variables' values on the forecast period >>> forecast_df = DataFrame(CONN, ... 'SELECT * ' ... 'from APL_SAMPLES.CASHFLOWS_FULL ' ... 'WHERE "Cash" is null LIMIT 5') >>> model = AutoTimeSeries(CONN, time_column_name='Date', target='Cash', horizon=3) >>> model.fit(train_df) >>> out = model.forecast(data=forecast_df) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 251 2001-12-29 None 6864.371407 -224.079492 13952.822306 252 2001-12-30 None 6889.515324 -211.264912 13990.295559 253 2001-12-31 None 6914.766513 -187.180923 14016.713949 254 2002-01-01 None 6940.124974 NaN NaN 255 2002-01-02 None 6965.590706 NaN NaN
Generating forecasts with the predict() method. The predict() method allows users to apply a fitted model on a dataset different from the training dataset. For example, users can train a dataset on the first quarter (January to March) and apply the model on a dataset of different period (March to May).
>>> # Trains the model on the first quarter, from January to March >>> train_df = DataFrame(CONN, ... 'SELECT "Date" , "Cash" ' ... 'from APL_SAMPLES.CASHFLOWS_FULL ' ... "where "Date" between '2001-01-01' and '2001-03-31'" ... " ORDER BY 1") >>> model.fit(train_df) >>> # Forecasts on a shifted period, from March to May >>> test_df = DataFrame(CONN, ... 'SELECT "Date", "Cash" ' ... 'from APL_SAMPLES.CASHFLOWS_FULL ' ... "where "Date" between '2001-03-01' and '2001-05-31'" ... " ORDER BY 1") >>> out = model.predict(test_df) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 60 2001-05-30 3837.196734000000105879735597214 4630.223083 NaN NaN 61 2001-05-31 2911.884261000000151398126928726 4635.265982 NaN NaN 62 2001-06-01 None 4538.516542 -1087.461104 10164.494188 63 2001-06-02 None 4848.815364 -5090.167255 14787.797983 64 2001-06-03 None 4853.858263 -5138.553275 14846.269801
Using the fit_predict() method This method enables the user to fit a model and generate forecasts on a single call, and thus get results faster. However, the model is created on the fly and deleted after use, so the user will not be able to save the resulting model.
>>> model.fit_predict(hana_df) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 249 2001-12-27 5995.42329499999 6055.761105 NaN NaN 250 2001-12-28 7111.41669699999 6314.336098 NaN NaN 251 2002-01-03 None 7033.880804 4529.462710 9538.298899 252 2002-01-04 None 6464.557223 3965.343397 8963.771049 253 2002-01-07 None 6469.141663 3961.414900 8976.868427
Breaking down the time series into trend, cycles, fluctuations and residuals components. If the parameter extra_applyout_settings is set to {‘ExtraMode’: True}, anytime a forecast method is called, predict(), forecast() or fit_predict(), the output will contain time series components and their corresponding residuals. The prediction columns are suffixed by the horizon number. For instance, ‘Cycles_RESIDUALS_3’ means the residual of the cycle component in the third horizon.
>>> model.fit(train_df) >>> model.set_params(extra_applyout_settings={'ExtraMode': True}) >>> out = model.predict(hana_df) >>> out.collect().tail(5) Date ACTUAL ... Cycles_RESIDUALS_3 Fluctuations_RESIDUALS_3 249 2001-12-27 5995.42329499392507553 ... 32.51 4.48e-13 250 2001-12-28 7111.41669699455205917 ... -644.77 1.14e-13 251 2002-01-03 None ... NaN NaN 252 2002-01-04 None ... NaN NaN 253 2002-01-07 None ... NaN NaN
Saving the model in the schema named ‘MODEL_STORAGE’ Please see model_storage class for further features of model storage.
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My model name' >>> model_storage.save_model(model=model, if_exists='replace')
Reloading the model
>>> model2 = AutoTimeSeries(conn_context=CONN) >>> model2.load_model(schema_name='MySchema', table_name='MyTable')
Predicting with the reloaded model
>>> # It is required to specify some attributes again >>> model2.set_params(time_column_name='Date', target='Cash') >>> hana_df = DataFrame(CONN, ... 'SELECT "Date" , "Cash" ' ... 'from APL_SAMPLES.CASHFLOWS_FULL ' ... 'ORDER BY 1') >>> out = model2.predict(hana_df, apply_horizon=3) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 249 2001-12-27 5995.42329499999 6055.761105 NaN NaN 250 2001-12-28 7111.41669699999 6314.336098 NaN NaN 251 2002-01-03 None 7033.880804 4529.462710 9538.298899 252 2002-01-04 None 6464.557223 3965.343397 8963.771049 253 2002-01-07 None 6469.141663 3961.414900 8976.868427
Users must set the training dataset again after loading the model(train_data_ parameter).
>>> model2.set_params(train_data_=hana_df, time_column_name='Date', target='Cash') >>> out = model2.forecast(forecast_length=3) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 249 2001-12-27 5995.42329499999 6055.761105 NaN NaN 250 2001-12-28 7111.41669699999 6314.336098 NaN NaN 251 2002-01-03 None 7033.880804 4529.462710 9538.298899 252 2002-01-04 None 6464.557223 3965.343397 8963.771049 253 2002-01-07 None 6469.141663 3961.414900 8976.868427
- Attributes
- model_: hana_ml DataFrame
The trained model content
- summary_: APLArtifactTable
The reference to the “SUMMARY” table generated by the model training. This table contains the summary about the model training.
- indicators_: APLArtifactTable
The reference to the “INDICATORS” table generated by the model training. This table contains the various metrics related to the model and its variables.
- fit_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table generated by the model training
- var_desc_: APLArtifactTable
The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training
- applyout_: hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table that is produced when making predictions.
- train_data_: hana_ml DataFrame
The train dataset
Methods
fit
(self, data[, key, features])Fits the model.
fit_predict
(self, data[, key, features, horizon])Fits a model and generate forecasts in a single call to the FORECAST APL function.
forecast
(self[, forecast_length, data])Uses the fitted model to generate out-of-sample forecasts.
get_fit_operation_log
(self)Retrieves the operation log table after the model training.
get_indicators
(self)Retrieves the Indicator table after model training.
get_model_components
(self)Returns a dictionary containing the description of the model components, that is trend, cycles and fluctuations, used by the model to generate forecasts.
get_params
(self)Retrieves attributes of the current object.
get_performance_metrics
(self)Returns a dictionary containing the performance metrics of the model.
Retrieves the operation log table after the model training.
get_summary
(self)Retrieves the summary table after model training.
is_fitted
(self)Checks if the model can be saved.
load_model
(self, schema_name, table_name[, oid])Loads the model from a table.
predict
(self, data[, apply_horizon, …])Uses the fitted model to generate forecasts.
save_artifact
(self, artifact_df, …[, …])Saves an artifact, a temporary table, into a permanent table.
save_model
(self, schema_name, table_name[, …])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
set_params
(self, \*\*parameters)Sets attributes of the current model.
-
set_params
(self, **parameters)¶ Sets attributes of the current model.
- Parameters
- parameters: dict
Contains attribute names and values in the form of keyword arguments
-
fit
(self, data, key=None, features=None)¶ Fits the model.
- Parameters
- data: hana_ml DataFrame
The training dataset
- key: str, optional
The column used as row identifier of the dataset. This column corresponds to the time column name. As a result, setting this parameter will overwrite the time_column_name model setting.
- features: list of str, optional
The names of the feature columns, meaning the date column and the extra-predictive variables. If features is not provided, it defaults to all columns except the target column.
- Returns
- self: object
-
predict
(self, data, apply_horizon=None, apply_last_time_point=None)¶ Uses the fitted model to generate forecasts.
- Parameters
- data: hana_ml DataFrame
The input dataset used for predictions
- apply_horizon: int, optional
The number of forecasts to generate. By default, the number of forecasts is the horizon on which the model was trained.
- apply_last_time_point: str, optional
The time point corresponding to the start of the forecast period. Forecasts will be generated starting from the next time point after the ‘apply_last_time_point’. By default, this parameter is set to the value of ‘last_training_time_point’ known from the model training.
- Returns
- hana_ml DataFrame
- By default the output contains the following columns:
<the name of the time column>
ACTUAL: the actual value of time series
PREDICTED: the forecast value
LOWER_INT_95PCT: the lower limit of 95% confidence interval
UPPER_INT_95PCT: the upper limit of 95% confidence interval
If ExtraMode is set to true, the output dataframe will also contain the breaking down of the time series into a trend, cycles, fluctuations and residuals components.
Examples
Default output
>>> out = model.predict(hana_df) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 249 2001-12-27 5995.42329499999 6055.761105 NaN NaN 250 2001-12-28 7111.41669699999 6314.336098 NaN NaN 251 2002-01-03 None 7033.88080 4529.46271 9538.29889 252 2002-01-04 None 6464.55722 3965.34339 8963.77104 253 2002-01-07 None 6469.14166 3961.41490 8976.86842
Retrieving forecasts and components (predicted, trend, cycles and fluctuations). The output columns are suffixed with the horizon index. For example, Trend_1 means the trend component of the first horizon.
>>> model.set_params(extra_applyout_settings={'ExtraMode': True}) >>> out = model.predict(hana_df) >>> out.collect().tail(5) Date ACTUAL PREDICTED_1 Trend_1 249 2001-12-27 5995.423294999999598076101392507553 6055.761105 6814.405390 ... 250 2001-12-28 7111.416696999999658146407455205917 6314.336098 6839.334762 ... 251 2002-01-03 None 7033.880804 6991.163710 ... 252 2002-01-04 None 6464.557223 7016.843985 ... 253 2002-01-07 None 6469.141663 7094.528433 ...
-
fit_predict
(self, data, key=None, features=None, horizon=None)¶ Fits a model and generate forecasts in a single call to the FORECAST APL function. This method offers a faster way to perform the model training and forecasting.
However, the user will not have access to the model used internally since it is deleted after the computation of the forecasts.
- Parameters
- data: hana_ml DataFrame
The input time series dataset
- key: str, optional
The date column name. By default, it is equal to the model parameter time_column_name. If it is given, the model parameter time_column_name will be overwritten.
- features: list of str, optional
The column names corresponding to the extra-predictable variables (exogenous variables). If features is not provided, it is equal to all columns except the target column.
- horizon: int, optional
The number of forecasts to generate. The default value equals to the horizon parameter of the model.
- Returns
- hana_ml DataFrame
The output is the same as the predict() method.
-
forecast
(self, forecast_length=None, data=None)¶ Uses the fitted model to generate out-of-sample forecasts. The model is supposed to be already fitted with a given dataset (training dataset). This method forecasts over a number of steps after the end of the training dataset. When there are extra-predictive variable (exogenous variables), the input parameter data is required. It must contain the values of the extra-predictable variables for the forecast period. If there is no extra-predictive variable, only the forecast_length parameter is needed.
- Parameters
- forecast_length: int, optional
The number of forecasts to generate from the end of the train dataset. This parameter is by default the horizon specified in the model parameter.
- data: hana_ml DataFrame, optional
The time series with extra-predictable variables used for forecasting. This parameter is required if extra-predictive variables are used in the model. When this parameter is given, the parameter ‘forecast_length’ is ignored.
- Returns
- hana_ml DataFrame
The output is the same as the predict() method.
Examples
Case where there is no extra-predictable variable:
>>> train_df = DataFrame(CONN, 'SELECT "Date" , "Cash" ' 'from APL_SAMPLES.CASHFLOWS_FULL ' 'where "Cash" is not null ' 'ORDER BY 1') >>> print(train_df.collect().tail(5)) Date Cash 246 2001-12-20 6382.441052 247 2001-12-21 5652.882539 248 2001-12-26 5081.372996 249 2001-12-27 5995.423295 250 2001-12-28 7111.416697
>>> model = AutoTimeSeries(CONN, time_column_name='Date', target='Cash', horizon=3) >>> model.fit(train_df) >>> out = model.forecast(forecast_length=3) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 249 2001-12-27 5995.42329499999901392507553 6814.405390 NaN NaN 250 2001-12-28 7111.41669699999907455205917 6839.334762 NaN NaN 251 2001-12-29 None 6864.371407 -224.079492 13952.822306 252 2001-12-30 None 6889.515324 -211.264912 13990.295559 253 2001-12-31 None 6914.766513 -187.180923 14016.713949
Case where there are extra-predictable variables:
>>> train_df = DataFrame(CONN, 'SELECT * ' 'from APL_SAMPLES.CASHFLOWS_FULL ' 'WHERE "Cash" is not null ' 'ORDER BY 1') >>> print(train_df.collect().tail(5)) Date WorkingDaysIndices ... BeforeLastWMonth Cash 246 2001-12-20 13 ... 1 6382.441052 247 2001-12-21 14 ... 1 5652.882539 248 2001-12-26 15 ... 0 5081.372996 249 2001-12-27 16 ... 0 5995.423295 250 2001-12-28 17 ... 0 7111.416697
>>> # Extra-predictable variables to be provided as the forecast period >>> forecast_df = DataFrame(CONN, 'SELECT * ' 'from APL_SAMPLES.CASHFLOWS_FULL ' 'WHERE "Cash" is null ' 'ORDER BY 1 ' 'LIMIT 3') >>> print(forecast_df.collect()) Date WorkingDaysIndices ... BeforeLastWMonth Cash 0 2002-01-03 0 ... 0 None 1 2002-01-04 1 ... 0 None 2 2002-01-07 2 ... 0 None
>>> model = AutoTimeSeries(CONN, time_column_name='Date', target='Cash', horizon=3) >>> model.fit(train_df) >>> out = model.forecast(data=forecast_df) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 249 2001-12-27 5995.4232949999996101392507553 6814.41 NaN NaN 250 2001-12-28 7111.4166969999996407455205917 6839.33 NaN NaN 251 2001-12-29 None 6864.37 -224.08 13952.82 252 2001-12-30 None 6889.52 -211.26 13990.30 253 2001-12-31 None 6914.77 -187.18 14016.71
-
get_model_components
(self)¶ Returns a dictionary containing the description of the model components, that is trend, cycles and fluctuations, used by the model to generate forecasts.
- Returns
- A dictionary with 3 possible keys: ‘Trend’, ‘Cycles’, ‘Fluctuations’. For example:
>>> model.get_model_components() { "Trend": "Linear(TIME)", "Cycles": None, "Fluctuations": "AR(36)" }
-
get_performance_metrics
(self)¶ Returns a dictionary containing the performance metrics of the model. The metrics are provided for each forecast horizon.
- Returns
- Dictionary
The dictionary contains the performance metrics of the current model. Each metric is associated to a list containing <horizon> elements. This list contains the values of the metric measured for horizon 1 to <horizon>.
-
load_model
(self, schema_name, table_name, oid=None)¶ Loads the model from a table.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr, optional
If the table contains several models, the OID must be given as an identifier. If it is not provided, the whole table is read.
-
get_fit_operation_log
(self)¶ Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs of the last model training
-
get_indicators
(self)¶ Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
- This table provides the performance metrics of the last model training
-
get_params
(self)¶ Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
-
get_predict_operation_log
(self)¶ Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs about the last prediction
-
get_summary
(self)¶ Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
- This contains execution summary of the last model training
-
is_fitted
(self)¶ Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
-
save_artifact
(self, artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)¶ Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact( ... artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace' ... )
-
save_model
(self, schema_name, table_name, if_exists='fail', new_oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
- The model is saved into a table with the following columns:
“OID” NVARCHAR(50), – Serve as ID
“FORMAT” NVARCHAR(50), – APL technical info
“LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model
hana_ml.algorithms.apl.gradient_boosting_classification¶
This module provides the SAP HANA APL gradient boosting classification algorithm.
The following classes are available:
-
class
hana_ml.algorithms.apl.gradient_boosting_classification.
GradientBoostingClassifier
(conn_context, early_stopping_patience=10, eval_metric='MultiClassLogLoss', learning_rate=0.05, max_depth=4, max_iterations=1000, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)¶ Bases:
hana_ml.algorithms.apl.gradient_boosting_classification._GradientBoostingClassifierBase
SAP HANA APL Gradient Boosting Multiclass Classifier algorithm.
- Parameters
- conn_context: ConnectionContext
The connection object to an SAP HANA database
- early_stopping_patience: int, optional
If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. The default value is 10.
- eval_metric: str, optional
The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are ‘MultiClassClassificationError’ and ‘MultiClassLogLoss’. The default value is ‘MultiClassLogLoss’.
- learning_rate: float, optional
The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. The default value is 0.05.
- max_depth: int, optional
The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. The default value is 4.
- max_iterations: int, optional
The maximum number of boosting iterations to fit the model. The default value is 1000.
- number_of_jobs: int, optional
The number of threads allocated to the model training and apply parallelization. The default value is 4.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value type (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.
- extra_applyout_settings: dict, optional
Determines the output of the predict() method. The possible values are: - By default (None value): the default output.
<KEY>: the key column if it provided in the dataset
TRUE_LABEL: the class label if provided in the dataset
PREDICTED: the predicted label
PROBABILITY: the probability of the prediction(confidence)
- {‘APL/ApplyExtraMode’: ‘AllProbabilities’}: the probabilities for each class.
<KEY>: the key column if provided in the dataset
TRUE_LABEL: the class label if given in the dataset
PREDICTED: the predicted label
PROBA_<label_value1>: the probability for the class <label_value1>
… - PROBA_<label_valueN>: the probability for the class <label_valueN>
{‘APL/ApplyExtraMode’: ‘Individual Contributions’}: the feature importance for every
- sample
<KEY>: the key column if provided in the dataset
TRUE_LABEL: the class label when if provided in the dataset
PREDICTED: the predicted label
gb_contrib_<VAR1>: the contribution of the variable VAR1 to the score
… - gb_contrib_<VARN>: the contribution of the variable VARN to the score - gb_contrib_constant_bias: the constant bias contribution to the score
Notes
It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won’t be possible anymore to have a key defined in any input dataset. The key is particularly useful to join the predictions output to the input dataset.
By default, if not provided, SAP HANA APL guesses the variable description by reading the first 100 rows. But, the results may be incorrect. The user can overwrite the guessed description by explicitly setting the variable_storages, variable_value_types and variable_missing_strings parameters. For example:
model.set_params( variable_storages = { 'ID': 'integer', 'sepal length (cm)': 'number' }) model.set_params( variable_value_types = { 'sepal length (cm)': 'continuous' }) model.set_params( variable_missing_strings = { 'sepal length (cm)': '-1' })
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingClassifier >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS') >>> # -- Creates hana_ml DataFrame >>> hana_df = DataFrame(CONN, 'SELECT "id", "class", "capital-gain", ' '"native-country" from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = GradientBoostingClassifier(conn_context=CONN) >>> model.fit(hana_df, label='native-country', key='id')
Debriefing
>>> # Global performance metrics of the model >>> model.get_performance_metrics() {'BalancedErrorRate': 0.9761904761904762, 'BalancedClassificationRate': 0.023809523809523808, ...
>>> # Performance metrics of the model for each class >>> model.get_metrics_per_class() {'Precision': {'Cambodia': 0.0, 'Canada': 0.0, 'China': 0.0, 'Columbia': 0.0...
>>> model.get_feature_importances() {'Gain': OrderedDict([('class', 0.7713800668716431), ('capital-gain', 0.22861991822719574)])}
Making predictions
>>> # Default output >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame id TRUE_LABEL PREDICTED PROBABILITY 0 30 United-States United-States 0.89051 1 63 United-States United-States 0.89051 2 66 United-States United-States 0.89051 >>> # All probabilities >>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'AllProbabilities'}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame id TRUE_LABEL PREDICTED PROBA_? PROBA_Cambodia ... 35194 19272 United-States United-States 0.016803 0.000595 ... 20186 39624 United-States United-States 0.017564 0.001063 ... 43892 38759 United-States United-States 0.019812 0.000353 ... >>> # Individual Contributions >>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame id TRUE_LABEL PREDICTED gb_contrib_class gb_contrib_capital-gain ... 0 30 United-States United-States -0.025366 -0.014416 ... 1 63 United-States United-States -0.025366 -0.014416 ... 2 66 United-States United-States -0.025366 -0.014416 ...
Saving the model in the schema named ‘MODEL_STORAGE’
>>> from hana_ml.model_storage import ModelStorage >>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My model name' >>> model_storage.save_model(model=model, if_exists='replace')
Reloading the model for new predictions >>> model2 = model_storage.load_model(name=’My model name’) >>> out2 = model2.predict(data=hana_df)
Please see model_storage class for further features of model storage
- Attributes
- label: str
The target column name. This attribute is set when the fit() method is called.
- model_: hana_ml DataFrame
The trained model content
- summary_: APLArtifactTable
The reference to the “SUMMARY” table generated by the model training. This table contains the content of the model training summary.
- indicators_: APLArtifactTable
The reference to the “INDICATORS” table generated by the model training. This table contains various metrics related to the model and model variables.
- fit_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table generated by the model training
- var_desc_: APLArtifactTable
The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training
- applyout_: hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table when a prediction was made
Methods
fit
(self, data[, key, features, label])Fits the model.
get_best_iteration
(self)Returns the iteration that has provided the best performance on the validation dataset during the model training.
get_evalmetrics
(self)Returns the values of the evaluation metric at each iteration.
get_feature_importances
(self)Returns the feature importances.
get_fit_operation_log
(self)Retrieves the operation log table after the model training.
get_indicators
(self)Retrieves the Indicator table after model training.
get_metrics_per_class
(self)Returns the performance for each class.
get_params
(self)Retrieves attributes of the current object.
get_performance_metrics
(self)Returns the performance metrics of the last trained model.
Retrieves the operation log table after the model training.
get_summary
(self)Retrieves the summary table after model training.
is_fitted
(self)Checks if the model can be saved.
load_model
(self, schema_name, table_name[, oid])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
predict
(self, data)Makes predictions with the fitted model.
save_artifact
(self, artifact_df, …[, …])Saves an artifact, a temporary table, into a permanent table.
save_model
(self, schema_name, table_name[, …])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
score
(self, data)Returns the mean accuracy on the provided test dataset.
set_params
(self, \*\*parameters)Sets attributes of the current model.
-
set_params
(self, **parameters)¶ Sets attributes of the current model.
- Parameters
- parameters: dict
The names and values of the attributes to change
-
get_metrics_per_class
(self)¶ Returns the performance for each class.
- Returns
- A dictionary.
-
fit
(self, data, key=None, features=None, label=None)¶ Fits the model.
- Parameters
- dataDataFrame
The training dataset
- keystr, optional
The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.
- featureslist of str, optional
The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.
- labelstr, optional
The name of the label column. Default is the last column.
- Returns
- selfobject
Notes
It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won’t be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.
-
get_best_iteration
(self)¶ Returns the iteration that has provided the best performance on the validation dataset during the model training.
- Returns
- The best iteration: int
-
get_evalmetrics
(self)¶ Returns the values of the evaluation metric at each iteration. These values are based on the estimation dataset.
- Returns
- A dictionary:
{‘<MetricName>’: <List of values>}
-
get_feature_importances
(self)¶ Returns the feature importances.
- Returns
- feature importancesdict
{ <importance_metric> : OrderedDictionary({ <feature_name> : <value> })
-
get_fit_operation_log
(self)¶ Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs of the last model training
-
get_indicators
(self)¶ Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
- This table provides the performance metrics of the last model training
-
get_params
(self)¶ Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
-
get_performance_metrics
(self)¶ Returns the performance metrics of the last trained model.
- Returns
- A dictionary with metric name as key and metric value as value.
-
get_predict_operation_log
(self)¶ Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs about the last prediction
-
get_summary
(self)¶ Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
- This contains execution summary of the last model training
-
is_fitted
(self)¶ Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
-
load_model
(self, schema_name, table_name, oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Loads the model from a table.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
-
predict
(self, data)¶ Makes predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the ‘extra_applyout_settings’ parameter in the model. This parameter is described with examples in the class section.
- Parameters
- data: hana_ml DataFrame
The input dataset used for prediction
- Returns
- Prediction output: hana_ml DataFrame
- The default output is (if the model ‘extra_applyout_settings’ parameter is unset):
ID: the key column
TRUE_LABEL: the true label if it is given in the input dataset
PREDICTED: the predicted label
PROBABILITY: the probability of the predicted label
- In multinomial classification, users can request the probabilities of all classes by
- setting the parameter ‘extra_applyout_settings’ to
- {‘APL/ApplyExtraMode’: ‘AllProbabilities’}.
- The output will be:
ID: the key column
TRUE_LABEL: the true label if it is given in the input dataset
PREDICTED: the predicted label
PROBA_<class_1>: the probability of the class <class_1>
… - PROBA_<class_n>: the probability of the class <class_n>
- To get the individual contributions of each variable for each individual sample,
- the ‘extra_applyout_settings’ parameter must be set to
- {‘APL/ApplyExtraMode’: ‘Individual Contributions’}.
- The output will contain the following columns:
ID: key column,
TRUE_LABEL: the actual label
PREDICTED: the predicted label
gb_contrib_<VAR1>: the contribution of the variable <VAR1> to the score
… - gb_contrib_<VARN>: the contribution of the variable <VARN> to the score - gb_contrib_constant_bias: the constant bias contribution to the score
- Users can also set APL/ApplyExtraMode with other values, for instance:
- ‘extra_applyout_settings’ = {‘APL/ApplyExtraMode’: ‘BestProbabilityAndDecision’}.
- New SAP Hana APL settings may be provided over time, so please check the SAP HANA APL
- documentation to know which settings are available:
- See Function Reference > Predictive Model Services > APPLY_MODEL > OPERATION_CONFIG
- Parameters in the SAP HANA APL Reference Guide.
-
save_artifact
(self, artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)¶ Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact( ... artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace' ... )
-
save_model
(self, schema_name, table_name, if_exists='fail', new_oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
- The model is saved into a table with the following columns:
“OID” NVARCHAR(50), – Serve as ID
“FORMAT” NVARCHAR(50), – APL technical info
“LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model
-
score
(self, data)¶ Returns the mean accuracy on the provided test dataset.
- Parameters
- data: hana_ml DataFrame
The test dataset used to compute the score. The labels must be provided in the dataset.
- Returns
- mean average accuracy: float
-
class
hana_ml.algorithms.apl.gradient_boosting_classification.
GradientBoostingBinaryClassifier
(conn_context, early_stopping_patience=10, eval_metric='LogLoss', learning_rate=0.05, max_depth=4, max_iterations=1000, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)¶ Bases:
hana_ml.algorithms.apl.gradient_boosting_classification._GradientBoostingClassifierBase
SAP HANA APL Gradient Boosting Binary Classifier algorithm. It is very similar to GradientBoostingClassifier, the multiclass classifier. Its particularity lies in the provided metrics which are specific to binary classification.
- Parameters
- conn_context: ConnectionContext
The connection object to an SAP HANA database
- early_stopping_patience: int, optional
If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. The default value is 10.
- eval_metric: str, optional
The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are ‘LogLoss’,’AUC’ and ‘ClassificationError’. The default value is ‘LogLoss’.
- learning_rate: float, optional
The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. The default value is 0.05.
- max_depth: int, optional
The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. The default value is 4.
- max_iterations: int, optional
The maximum number of boosting iterations to fit the model. The default value is 1000.
- number_of_jobs: int, optional
The number of threads allocated to the model training and apply parallelization. The default value is 4.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value type (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.
- extra_applyout_settings: dict, optional
Determines the output of the predict() method. The possible values are: - By default (None value): the default output.
<KEY>: the key column if provided in the dataset
TRUE_LABEL: the class label if provided in the dataset
PREDICTED: the predicted label
PROBABILITY: the probability of the prediction(confidence)
{‘APL/ApplyExtraMode’: ‘Individual Contributions’}: the individual contributions of each
- variable to the score. The output is:
<KEY>: the key column if provided in the dataset
TRUE_LABEL: the class label if provided in the dataset
gb_contrib_<VAR1>: the contribution of the variable VAR1 to the score
… - gb_contrib_<VARN>: the contribution of the variable VARN to the score - gb_contrib_constant_bias: the constant bias contribution to the score
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS) >>> # -- Creates hana_ml DataFrame >>> hana_df = DataFrame(CONN, 'SELECT * from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = GradientBoostingBinaryClassifier(conn_context=CONN) >>> model.fit(hana_df, label='class', key='id')
Debriefing
>>> # Global performance metrics of the model >>> model.get_performance_metrics() {'LogLoss': 0.2567069689038737, 'PredictivePower': 0.8529, 'PredictionConfidence': 0.9759, ...}
>>> model.get_feature_importances() {'Gain': OrderedDict([('relationship', 0.3866586685180664), ('education-num', 0.1502334326505661)...
Making predictions
>>> # Default output >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().sample(3) # returns the output as a pandas DataFrame id TRUE_LABEL PREDICTED PROBABILITY 44903 41211 0 0 0.871326 47878 36020 1 1 0.993455 17549 6601 0 1 0.673872
>>> # Individual Contributions >>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().sample(3) # returns the output as a pandas DataFrame id TRUE_LABEL gb_contrib_age gb_contrib_workclass gb_contrib_fnlwgt ... 0 18448 0 -1.098452 -0.001238 0.060850 ... 1 18457 0 -0.731512 -0.000448 0.020060 ... 2 18540 0 -0.024523 0.027065 0.158083 ...
Saving the model in the schema named ‘MODEL_STORAGE’ Please see model_storage class for further features of model storage.
>>> from hana_ml.model_storage import ModelStorage >>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My model name' >>> model_storage.save_model(model=model, if_exists='replace')
Reloading the model for new predictions >>> model2 = model_storage.load_model(name=’My model name’) >>> out2 = model2.predict(data=hana_df)
- Attributes
- label: str
The target column name. This attribute is set when the fit() method is called.
- model_: hana_ml DataFrame
The trained model content
- summary_: APLArtifactTable
The reference to the “SUMMARY” table generated by the model training. This table contains the content of the model training summary.
- indicators_: APLArtifactTable
The reference to the “INDICATORS” table generated by the model training. This table contains various metrics related to the model and model variables.
- fit_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table generated by the model training
- var_desc_: APLArtifactTable
The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training
- applyout_: hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table when a prediction was made
Methods
fit
(self, data[, key, features, label])Fits the model.
get_best_iteration
(self)Returns the iteration that has provided the best performance on the validation dataset during the model training.
get_evalmetrics
(self)Returns the values of the evaluation metric at each iteration.
get_feature_importances
(self)Returns the feature importances.
get_fit_operation_log
(self)Retrieves the operation log table after the model training.
get_indicators
(self)Retrieves the Indicator table after model training.
get_params
(self)Retrieves attributes of the current object.
get_performance_metrics
(self)Returns the performance metrics of the last trained model.
Retrieves the operation log table after the model training.
get_summary
(self)Retrieves the summary table after model training.
is_fitted
(self)Checks if the model can be saved.
load_model
(self, schema_name, table_name[, oid])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
predict
(self, data)Makes predictions with the fitted model.
save_artifact
(self, artifact_df, …[, …])Saves an artifact, a temporary table, into a permanent table.
save_model
(self, schema_name, table_name[, …])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
score
(self, data)Returns the mean accuracy on the provided test dataset.
set_params
(self, \*\*parameters)Sets attributes of the current model.
-
set_params
(self, **parameters)¶ Sets attributes of the current model.
- Parameters
- parameters: dict
The attribute names and values
-
get_performance_metrics
(self)¶ Returns the performance metrics of the last trained model.
- Returns
- A dictionary with metric name as key and metric value as value.
-
fit
(self, data, key=None, features=None, label=None)¶ Fits the model.
- Parameters
- dataDataFrame
The training dataset
- keystr, optional
The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.
- featureslist of str, optional
The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.
- labelstr, optional
The name of the label column. Default is the last column.
- Returns
- selfobject
Notes
It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won’t be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.
-
get_best_iteration
(self)¶ Returns the iteration that has provided the best performance on the validation dataset during the model training.
- Returns
- The best iteration: int
-
get_evalmetrics
(self)¶ Returns the values of the evaluation metric at each iteration. These values are based on the estimation dataset.
- Returns
- A dictionary:
{‘<MetricName>’: <List of values>}
-
get_feature_importances
(self)¶ Returns the feature importances.
- Returns
- feature importancesdict
{ <importance_metric> : OrderedDictionary({ <feature_name> : <value> })
-
get_fit_operation_log
(self)¶ Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs of the last model training
-
get_indicators
(self)¶ Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
- This table provides the performance metrics of the last model training
-
get_params
(self)¶ Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
-
get_predict_operation_log
(self)¶ Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs about the last prediction
-
get_summary
(self)¶ Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
- This contains execution summary of the last model training
-
is_fitted
(self)¶ Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
-
load_model
(self, schema_name, table_name, oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Loads the model from a table.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
-
predict
(self, data)¶ Makes predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the ‘extra_applyout_settings’ parameter in the model. This parameter is described with examples in the class section.
- Parameters
- data: hana_ml DataFrame
The input dataset used for prediction
- Returns
- Prediction output: hana_ml DataFrame
- The default output is (if the model ‘extra_applyout_settings’ parameter is unset):
ID: the key column
TRUE_LABEL: the true label if it is given in the input dataset
PREDICTED: the predicted label
PROBABILITY: the probability of the predicted label
- In multinomial classification, users can request the probabilities of all classes by
- setting the parameter ‘extra_applyout_settings’ to
- {‘APL/ApplyExtraMode’: ‘AllProbabilities’}.
- The output will be:
ID: the key column
TRUE_LABEL: the true label if it is given in the input dataset
PREDICTED: the predicted label
PROBA_<class_1>: the probability of the class <class_1>
… - PROBA_<class_n>: the probability of the class <class_n>
- To get the individual contributions of each variable for each individual sample,
- the ‘extra_applyout_settings’ parameter must be set to
- {‘APL/ApplyExtraMode’: ‘Individual Contributions’}.
- The output will contain the following columns:
ID: key column,
TRUE_LABEL: the actual label
PREDICTED: the predicted label
gb_contrib_<VAR1>: the contribution of the variable <VAR1> to the score
… - gb_contrib_<VARN>: the contribution of the variable <VARN> to the score - gb_contrib_constant_bias: the constant bias contribution to the score
- Users can also set APL/ApplyExtraMode with other values, for instance:
- ‘extra_applyout_settings’ = {‘APL/ApplyExtraMode’: ‘BestProbabilityAndDecision’}.
- New SAP Hana APL settings may be provided over time, so please check the SAP HANA APL
- documentation to know which settings are available:
- See Function Reference > Predictive Model Services > APPLY_MODEL > OPERATION_CONFIG
- Parameters in the SAP HANA APL Reference Guide.
-
save_artifact
(self, artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)¶ Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact( ... artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace' ... )
-
save_model
(self, schema_name, table_name, if_exists='fail', new_oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
- The model is saved into a table with the following columns:
“OID” NVARCHAR(50), – Serve as ID
“FORMAT” NVARCHAR(50), – APL technical info
“LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model
-
score
(self, data)¶ Returns the mean accuracy on the provided test dataset.
- Parameters
- data: hana_ml DataFrame
The test dataset used to compute the score. The labels must be provided in the dataset.
- Returns
- mean average accuracy: float
hana_ml.algorithms.apl.gradient_boosting_regression¶
This module provides the SAP HANA APL gradient boosting regression algorithm.
The following classes are available:
-
class
hana_ml.algorithms.apl.gradient_boosting_regression.
GradientBoostingRegressor
(conn_context, early_stopping_patience=10, eval_metric='RMSE', learning_rate=0.05, max_depth=4, max_iterations=1000, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)¶ Bases:
hana_ml.algorithms.apl.gradient_boosting_base.GradientBoostingBase
SAP HANA APL Gradient Boosting Regression algorithm.
- Parameters
- conn_context: ConnectionContext
The connection object to an SAP HANA database
- early_stopping_patience: int, optional
If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. The default value is 10.
- eval_metric: str, optional
The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are ‘MAE’ and ‘RMSE’. The default value is ‘RMSE’.
- learning_rate: float, optional
The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. The default value is 0.05.
- max_depth: int, optional
The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. The default value is 4.
- max_iterations: int, optional
The maximum number of boosting iterations to fit the model. The default value is 1000.
- number_of_jobs: int, optional
The number of threads allocated to the model training and apply parallelization. The default value is 4.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {‘VAR1’: ‘string’, ‘VAR2’: ‘number’}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value types (continuous, nominal, ordinal). For example, {‘VAR1’: ‘continuous’, ‘VAR2’: ‘nominal’}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {‘VAR1’: ‘???’} means anytime the variable value equals to ‘???’, it will be taken as missing.
- extra_applyout_settings: dict, optional
Determines the output of the predict() method. The possible values are: - By default (None value): the default output.
<KEY>: the key column if provided in the dataset
TRUE_LABEL: the actual value if provided
PREDICTED: the predicted value
{‘APL/ApplyExtraMode’: ‘Individual Contributions’}: the feature importance for every
- sample
<KEY>: the key column if provided
TRUE_LABEL: the actual value if provided
PREDICTED: the predicted value
gb_contrib_<VAR1>: the contribution of the VAR1 variable to the score
… - gb_contrib_<VARN>: the contribution of the VARN variable to the score - gb_contrib_constant_bias: the constant bias contribution
Notes
It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won’t be possible anymore to have a key defined in any input dataset. The key is particularly useful to join the predictions output to the input dataset.
By default, if not provided, SAP HANA APL guesses the variable description by reading the first 100 rows. But, the results may be incorrect. The user can overwrite the guessed description by explicitly setting the variable_storages, variable_value_types and variable_missing_strings parameters. For example:
model.set_params( variable_storages = { 'ID': 'integer', 'sepal length (cm)': 'number' }) model.set_params( variable_value_types = { 'sepal length (cm)': 'continuous' }) model.set_params( variable_missing_strings = { 'sepal length (cm)': '-1' })
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_regression import GradientBoostingRegressor >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS') >>> # -- Creates hana_ml DataFrame >>> hana_df = DataFrame(CONN, ... 'SELECT "id", "class", "capital-gain", ' ... '"native-country", "age" from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = GradientBoostingRegressor(conn_context=CONN) >>> model.fit(hana_df, label='age', key='id')
Debriefing
>>> # Global performance metrics of the model >>> model.get_performance_metrics() {'L1': 7.31774, 'MeanAbsoluteError': 7.31774, 'L2': 9.42497, 'RootMeanSquareError': 9.42497, ...
>>> model.get_feature_importances() {'Gain': OrderedDict([('class', 0.8728259801864624), ('capital-gain', 0.10493823140859604), ...
Making predictions
>>> # Default output >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame id TRUE_LABEL PREDICTED 39184 21772 27 25 16537 7331 33 43 7908 35226 65 42 >>> # Individual Contributions >>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame id TRUE_LABEL gb_contrib_workclass gb_contrib_fnlwgt gb_contrib_education ... 0 6241 21 -1.330736 -0.385088 0.373539 ... 1 6248 18 -0.784536 -2.191791 -1.788672 ... 2 6253 26 -0.773891 0.358133 -0.185864 ...
Saving the model in the schema named ‘MODEL_STORAGE’
>>> from hana_ml.model_storage import ModelStorage >>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My model name' >>> model_storage.save_model(model=model, if_exists='replace')
Reloading the model for new predictions >>> model2 = model_storage.load_model(name=’My model name’) >>> out2 = model2.predict(data=hana_df)
Please see model_storage class for further features of model storage
- Attributes
- label: str
The target column name. This attribute is set when the fit() method is called. Users don’t need to set it explicitly, except if the model is loaded from a table. In this case, this attribute must be set before calling predict().
- model_: hana_ml DataFrame
The trained model content
- summary_: APLArtifactTable
The reference to the “SUMMARY” table generated by the model training. This table contains the content of the model training summary.
- indicators_: APLArtifactTable
The reference to the “INDICATORS” table generated by the model training. This table contains various metrics related to the model and model variables.
- fit_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table generated by the model training
- var_desc_: APLArtifactTable
The reference to the “VARIABLE_DESCRIPTION” table that was built during the model training
- applyout_: hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the “OPERATION_LOG” table when a prediction was made
Methods
fit
(self, data[, key, features, label])Fits the model.
get_best_iteration
(self)Returns the iteration that has provided the best performance on the validation dataset during the model training.
get_evalmetrics
(self)Returns the values of the evaluation metric at each iteration.
get_feature_importances
(self)Returns the feature importances.
get_fit_operation_log
(self)Retrieves the operation log table after the model training.
get_indicators
(self)Retrieves the Indicator table after model training.
get_params
(self)Retrieves attributes of the current object.
get_performance_metrics
(self)Returns the performance metrics of the last trained model.
Retrieves the operation log table after the model training.
get_summary
(self)Retrieves the summary table after model training.
is_fitted
(self)Checks if the model can be saved.
load_model
(self, schema_name, table_name[, oid])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
predict
(self, data)Generates predictions with the fitted model.
save_artifact
(self, artifact_df, …[, …])Saves an artifact, a temporary table, into a permanent table.
save_model
(self, schema_name, table_name[, …])Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
score
(self, data)Computes the R^2 (Coefficient of determination) indicator on the predictions of the provided dataset.
set_params
(self, \*\*parameters)Sets attributes of the current model.
-
set_params
(self, **parameters)¶ Sets attributes of the current model.
- Parameters
- parameters: dict
The attribute names and values
-
get_performance_metrics
(self)¶ Returns the performance metrics of the last trained model.
- Returns
- A dictionary with metric name as key and metric value as value.
-
predict
(self, data)¶ Generates predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the ‘extra_applyout_settings’ parameter in the model. This parameter is described with examples in the class section.
- Parameters
- data: hana_ml DataFrame
The input dataset used for prediction
- Returns
- Prediction output: hana_ml DataFrame
-
score
(self, data)¶ Computes the R^2 (Coefficient of determination) indicator on the predictions of the provided dataset.
- Parameters
- data: hana_ml DataFrame
The dataset used for prediction. It must contain the actual target values so that the score could be computed.
- Returns
- R2 indicator: float
-
fit
(self, data, key=None, features=None, label=None)¶ Fits the model.
- Parameters
- dataDataFrame
The training dataset
- keystr, optional
The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.
- featureslist of str, optional
The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.
- labelstr, optional
The name of the label column. Default is the last column.
- Returns
- selfobject
Notes
It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won’t be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.
-
get_best_iteration
(self)¶ Returns the iteration that has provided the best performance on the validation dataset during the model training.
- Returns
- The best iteration: int
-
get_evalmetrics
(self)¶ Returns the values of the evaluation metric at each iteration. These values are based on the estimation dataset.
- Returns
- A dictionary:
{‘<MetricName>’: <List of values>}
-
get_feature_importances
(self)¶ Returns the feature importances.
- Returns
- feature importancesdict
{ <importance_metric> : OrderedDictionary({ <feature_name> : <value> })
-
get_fit_operation_log
(self)¶ Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs of the last model training
-
get_indicators
(self)¶ Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
- This table provides the performance metrics of the last model training
-
get_params
(self)¶ Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
-
get_predict_operation_log
(self)¶ Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
- This table provides detailed logs about the last prediction
-
get_summary
(self)¶ Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
- This contains execution summary of the last model training
-
is_fitted
(self)¶ Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
-
load_model
(self, schema_name, table_name, oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Loads the model from a table.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
-
save_artifact
(self, artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)¶ Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact( ... artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace' ... )
-
save_model
(self, schema_name, table_name, if_exists='fail', new_oid=None)¶ Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists=’replace’), or an existing table (if_exists=’append’). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
- The model is saved into a table with the following columns:
“OID” NVARCHAR(50), – Serve as ID
“FORMAT” NVARCHAR(50), – APL technical info
“LOB” CLOB MEMORY THRESHOLD NULL – binary content of the model