hana_ml.algorithms.apl package
APL Package consists of the following sections:
hana_ml.algorithms.apl.gradient_boosting_classification
This module provides the SAP HANA APL gradient boosting classification algorithm.
The following classes are available:
- class hana_ml.algorithms.apl.gradient_boosting_classification.GradientBoostingClassifier(conn_context=None, early_stopping_patience=None, eval_metric=None, learning_rate=None, max_depth=None, max_iterations=None, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)
Bases:
_GradientBoostingClassifierBase
SAP HANA APL Gradient Boosting Multiclass Classifier algorithm.
- Parameters
- conn_contextConnectionContext, optional
The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().
- early_stopping_patience: int, optional
If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. Please refer to APL documentation for default value.
- eval_metric: str, optional
The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are 'MultiClassClassificationError' and 'MultiClassLogLoss'. Please refer to APL documentation for default value..
- learning_rate: float, optional
The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. Please refer to APL documentation for default value.
- max_depth: int, optional
The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. Please refer to APL documentation for default value.
- max_iterations: int, optional
The maximum number of boosting iterations to fit the model. The default value is 1000.
- number_of_jobs: int, optional
Deprecated.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value type (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.
- extra_applyout_settings: dict, optional
Determines the output of the predict() method. The possible values are:
By default (None value): the default output.
<KEY>: the key column if it provided in the dataset
TRUE_LABEL: the class label if provided in the dataset
PREDICTED: the predicted label
PROBABILITY: the probability of the prediction (confidence)
{'APL/ApplyExtraMode': 'AllProbabilities'}: the probabilities for each class.
<KEY>: the key column if provided in the dataset
TRUE_LABEL: the class label if given in the dataset
PREDICTED: the predicted label
PROBA_<label_value1>: the probability for the class <label_value1>
...
PROBA_<label_valueN>: the probability for the class <label_valueN>
{'APL/ApplyExtraMode': 'Individual Contributions'}: the feature importance for every sample
<KEY>: the key column if provided in the dataset
TRUE_LABEL: the class label when if provided in the dataset
PREDICTED: the predicted label
gb_contrib_<VAR1>: the contribution of the variable VAR1 to the score
...
gb_contrib_<VARN>: the contribution of the variable VARN to the score
gb_contrib_constant_bias: the constant bias contribution to the score
- other_params: dict optional
Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
'max_tasks'
'segment_column_name'
'cutting_strategy'
'interactions'
'interactions_max_kept'
'variable_auto_selection'
'variable_selection_max_nb_of_final_variables'
'variable_selection_max_iterations'
'variable_selection_percentage_of_contribution_kept_by_step'
'variable_selection_quality_bar'
See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.
For 'max_tasks', see FUNC_HEADER.
- other_train_apl_aliases: dict, optional
Users can provide APL aliases as advanced settings to the model. Users are free to input any possible value.
See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.
Notes
It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. The key is particularly useful to join the predictions output to the input dataset.
By default, if not provided, SAP HANA APL guesses the variable description by reading the first 100 rows. But, the results may be incorrect. The user can overwrite the guessed description by explicitly setting the variable_storages, variable_value_types and variable_missing_strings parameters. For example:
model.set_params(variable_storages={ 'ID': 'integer', 'sepal length (cm)': 'number' }) model.set_params(variable_value_types={ 'sepal length (cm)': 'continuous' }) model.set_params(variable_missing_strings={ 'sepal length (cm)': '-1' })
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_classification \ ... import GradientBoostingClassifier >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS') >>> # -- Creates hana_ml DataFrame >>> hana_df = DataFrame(CONN, 'SELECT "id", "class", "capital-gain", ' '"native-country" from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = GradientBoostingClassifier() >>> model.fit(hana_df, label='native-country', key='id')
Getting variable interactions
>>> model.set_params(other_train_apl_aliases={ ... 'APL/Interactions': 'true', ... 'APL/InteractionsMaxKept': '3' ... }) >>> model.fit(data=self._df_train, key=self._key, label=self._label) >>> # Checks interaction info in INDICATORS table >>> output = model.get_indicators().filter("KEY LIKE 'Interaction%'").collect()
Debriefing
>>> # Global performance metrics of the model >>> model.get_performance_metrics() {'BalancedErrorRate': 0.9761904761904762, 'BalancedClassificationRate': 0.023809523809523808, ...
>>> # Performance metrics of the model for each class >>> model.get_metrics_per_class() {'Precision': {'Cambodia': 0.0, 'Canada': 0.0, 'China': 0.0, 'Columbia': 0.0...
>>> model.get_feature_importances() {'Gain': OrderedDict([('class', 0.7713800668716431), ('capital-gain', 0.22861991822719574)])}
Generating the model report
>>> from hana_ml.visualizers.unified_report import UnifiedReport >>> UnifiedReport(model).build().display()
Making predictions
>>> # Default output >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame id TRUE_LABEL PREDICTED PROBABILITY 0 30 United-States United-States 0.89051 1 63 United-States United-States 0.89051 2 66 United-States United-States 0.89051 >>> # All probabilities >>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'AllProbabilities'}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame id TRUE_LABEL PREDICTED PROBA_? PROBA_Cambodia ... 35194 19272 United-States United-States 0.016803 0.000595 ... 20186 39624 United-States United-States 0.017564 0.001063 ... 43892 38759 United-States United-States 0.019812 0.000353 ... >>> # Individual Contributions >>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame id TRUE_LABEL PREDICTED gb_contrib_class gb_contrib_capital-gain ... 0 30 United-States United-States -0.025366 -0.014416 ... 1 63 United-States United-States -0.025366 -0.014416 ... 2 66 United-States United-States -0.025366 -0.014416 ...
Saving the model in the schema named 'MODEL_STORAGE'
>>> from hana_ml.model_storage import ModelStorage >>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My model name' >>> model_storage.save_model(model=model, if_exists='replace')
Reloading the model for new predictions
>>> model2 = model_storage.load_model(name='My model name') >>> out2 = model2.predict(data=hana_df)
Please see model_storage class for further features of model storage
Exporting the model in JSON format
>>> json_export = model.export_apply_code('JSON')
APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.
- Attributes
- label: str
The target column name. This attribute is set when the fit() method is called.
- model_: hana_ml DataFrame
The trained model content
- summary_: APLArtifactTable
The reference to the "SUMMARY" table generated by the model training. This table contains the content of the model training summary.
- indicators_: APLArtifactTable
The reference to the "INDICATORS" table generated by the model training. This table contains various metrics related to the model and model variables.
- fit_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table generated by the model training
- var_desc_: APLArtifactTable
The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training
- applyout_: hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table when a prediction was made
Methods
build_report
([max_local_explanations])Build model report.
HANA execution will be disabled and only SQL script will be generated.
HANA execution will be enabled.
export_apply_code
(code_type[, key, label, ...])Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.
fit
(data[, key, features, label, weight, ...])Fits the model.
generate_html_report
(filename)Save model report as a html file.
Render model report as a notebook iframe.
Gets the version and configuration information about the installation of SAP HANA APL.
Return the object recorder (for Design-time artifacts generation)
Returns the iteration that has provided the best performance on the validation dataset during the model training.
get_debrief_report
(report_name)Retrieves a standard statistical report.
Returns the values of the evaluation metric at each iteration.
Returns the feature importances.
Retrieves the operation log table after the model training.
Retrieves the Indicator table after model training.
Returns the performance for each class.
Get information about an existing model.
Retrieves attributes of the current object.
Returns the performance metrics of the last trained model.
Retrieves the operation log table after the model training.
Retrieves the summary table after model training.
Checks if the model can be saved.
load_model
(schema_name, table_name[, oid])Loads the model from a table.
predict
(data[, prediction_type])Makes predictions with the fitted model.
save_artifact
(artifact_df, schema_name, ...)Saves an artifact, a temporary table, into a permanent table.
save_model
(schema_name, table_name[, ...])Saves the model into a table.
schedule_fit
(output_table_name_model, ...)Creates a HANA scheduler job for the model fitting.
schedule_predict
(output_table_name_applyout, ...)Creates a HANA scheduler job for the model prediction.
score
(data)Returns the accuracy score on the provided test dataset.
set_framework_version
(framework_version)Switch v1/v2 version of report.
set_metric_samplings
([roc_sampling, ...])Set metric samplings to report builder.
set_params
(**parameters)Sets attributes of the current model.
set_scale_out
([route_to, no_route_to, ...])Specifies hints for scaling-out environment.
Use the reason code generated during the prediction phase to build a ShapleyExplainer instance.
set_shapley_explainer_of_score_phase
(...[, ...])Use the reason code generated during the scoring phase to build a ShapleyExplainer instance.
- set_params(**parameters)
Sets attributes of the current model.
- Parameters
- parameters: dict
The names and values of the attributes to change
- fit(data, key=None, features=None, label=None, weight=None, build_report=False)
Fits the model.
- Parameters
- dataDataFrame
The training dataset
- keystr, optional
The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.
- featureslist of str, optional
The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.
- labelstr, optional
The name of the label column. Default is the last column.
- weightstr, optional
The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.
- build_reportbool, optional
Whether to build report or not. Defaults to False.
- Returns
- selfobject
Notes
It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.
- score(data)
Returns the accuracy score on the provided test dataset.
- Parameters
- data: hana_ml DataFrame
The test dataset used to compute the score. The labels must be provided in the dataset.
- Returns
- Float or pandas DataFrame
If no segment column is given, the accuracy score.
If a segment column is given, a pandas DataFrame which contains the accuracy score for each segment.
- get_metrics_per_class()
Returns the performance for each class.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary.
If a segment column is given, a pandas DataFrame.
Examples
>>> data = DataFrame(conn, 'SELECT * from IRIS_MULTICLASSES') >>> model = GradientBoostingClassifier(conn) >>> model.fit(data=data, key='ID', label='LABEL') >>> model.get_metrics_per_class() { 'Precision': { 'setosa': 1.0, 'versicolor': 1.0, 'virginica': 0.9743589743589743 }, 'Recall': { 'setosa': 1.0, 'versicolor': 0.9714285714285714, 'virginica': 1.0 }, 'F1Score': { 'setosa': 1.0, 'versicolor': 0.9855072463768115, 'virginica': 0.9870129870129869 }
- build_report(max_local_explanations=100)
Build model report.
- Parameters
- max_local_explanations: int, optional
The maximum number of local explanations displayed in the report.
- set_metric_samplings(roc_sampling=None, other_samplings: Optional[dict] = None)
Set metric samplings to report builder.
- Parameters
- roc_sampling
Sampling
, optional ROC sampling.
- other_samplingsdict, optional
Key is column name of metric table.
CUMGAINS
RANDOM_CUMGAINS
PERF_CUMGAINS
LIFT
RANDOM_LIFT
PERF_LIFT
CUMLIFT
RANDOM_CUMLIFT
PERF_CUMLIFT
Value is sampling.
- roc_sampling
Examples
Creating the metric samplings:
>>> roc_sampling = Sampling(method='every_nth', interval=2)
>>> other_samplings = dict(CUMGAINS=Sampling(method='every_nth', interval=2), LIFT=Sampling(method='every_nth', interval=2), CUMLIFT=Sampling(method='every_nth', interval=2)) >>> model.set_metric_samplings(roc_sampling, other_samplings)
- disable_hana_execution()
HANA execution will be disabled and only SQL script will be generated.
- enable_hana_execution()
HANA execution will be enabled.
- export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)
Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.
- Parameters
- code_type: str
The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).
- key: str, optional
The name of the primary key column. Required for some code types.
- label: str, optional
The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.
- schema_name: str, optional
The schema name of the apply-in table. Required for some code types.
- table_name: str, optional
The apply-in table name. Required for some code types.
- other_params: dict, optional
The additional parameters to be included in the configuration. The available parameters are given in the developer guide.
- Returns
- The exported code: str
Examples
Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)
>>> json_export = model.export_apply_code('JSON')
APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.
Exporting SQL apply code (available for Robust Regression and Clustering)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS')
Exporting SQL apply code (probability generated in the output)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS', ... other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings', ... 'APL/ApplyProba': 'true'})
- generate_html_report(filename)
Save model report as a html file.
- Parameters
- filenamestr
Html file name.
- generate_notebook_iframe_report()
Render model report as a notebook iframe.
- get_apl_version()
Gets the version and configuration information about the installation of SAP HANA APL.
- Returns
- A pandas Dataframe with detailed information about the current version.
Notes
Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.
- get_artifacts_recorder()
Return the object recorder (for Design-time artifacts generation)
- get_best_iteration()
Returns the iteration that has provided the best performance on the validation dataset during the model training.
- Returns
- int or pandas DataFrame
If no segment column is given, the best iteration.
If a segment column is given, a pandas DataFrame which contains the best iteration for each segment.
- get_debrief_report(report_name)
Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.
- Parameters
- report_name: str
- Returns
- Statistical report: hana_ml DataFrame
- get_evalmetrics()
Returns the values of the evaluation metric at each iteration. These values are based on the validation dataset.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary: {'<MetricName>': <List of values>}.
If a segment column is given, a pandas DataFrame which contains the evaluation metrics for each segment.
- get_feature_importances()
Returns the feature importances.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary: { <importance_metric> : OrderedDictionary({ <feature_name> : <value> }) }.
If a segment column is given, a pandas DataFrame which contains the feature importances for each segment.
- get_fit_operation_log()
Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
- get_indicators()
Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
- get_model_info()
Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.
- Returns
- list
List of HANA DataFrames respectively corresponding to the following tables:
Summary table
Variable roles table
Variable description table
Indicators table
Profit curves table
- get_params()
Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
- get_performance_metrics()
Returns the performance metrics of the last trained model.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary with metric name as key and metric value as value.
If a segment column is given, a pandas DataFrame which contains the performance metrics for each segment.
Examples
>>> data = DataFrame(conn, 'SELECT * from APL_SAMPLES.CENSUS') >>> model = GradientBoostingBinaryClassifier(conn) >>> model.fit(data=data, key='id', label='class') >>> model.get_performance_metrics() {'AUC': 0.9385, 'PredictivePower': 0.8529, 'PredictionConfidence': 0.9759,...}
- get_predict_operation_log()
Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
- get_summary()
Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
- is_fitted()
Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
- load_model(schema_name, table_name, oid=None)
Loads the model from a table.
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
- predict(data, prediction_type=None)
Makes predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the 'prediction_type' parameter.
- Parameters
- data: hana_ml DataFrame
The input dataset used for prediction
- prediction_type: string, optional
Can be: - 'BestProbabilityAndDecision': return the probability value associated with the classification decision (default) - 'Decision': return the classification decision - 'Probability': return the probability that the row is a positive target (in binary classification) or the probabilities of all classes (in multiclass classification) - 'Score': return raw prediction scores - 'Individual Contributions': return SHAP values - 'Explanations': return strength indicators based on SHAP values
- Returns
- Prediction output: hana_ml DataFrame
- save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)
Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact(artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace')
- save_model(schema_name, table_name, if_exists='fail', new_oid=None)
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
The model is saved into a table with the following columns:
"OID" NVARCHAR(50), -- Serve as ID
"FORMAT" NVARCHAR(50), -- APL technical info
"LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model
- schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)
Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- output_table_name_model: str
The output table name for the model binary.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- output_table_name_indicators: str
The output table name for the model indicators.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.schedule_fit(
output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)
Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- input_table_name_model: str
The input table name for the model binary.
- output_table_name_applyout: str
The output table name for the prediction data.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.predict(data)
- >>> model.schedule_predict(
output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- set_framework_version(framework_version)
Switch v1/v2 version of report.
- Parameters
- framework_version{'v2', 'v1'}, optional
v2: using report builder framework. v1: using pure html template.
Defaults to 'v2'.
- set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)
Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.
- Parameters
- route_tostr, optional
Routes the query to the specified volume ID or service type.
Defaults to None.
- no_route_tostr or list of str, optional
Avoids query routing to a specified volume ID or service type.
Defaults to None.
- route_bystr, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s).
Defaults to None.
- route_by_cardinalitystr or list of str, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.
- data_transfer_costint, optional
Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.
Defaults to None.
- route_optimization_level{'minimal', 'all'}, optional
Guides the optimizer to compile with
route_optimization_level
'minimal' or to default toroute_optimization_level
. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.- workload_classstr, optional
Routes the query via workload class.
route_to
statement hint has higher precedence thanworkload_class
statement hint.Defaults to None.
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier >>> model = GradientBoostingBinaryClassifier() >>> # Routes the execution to a specific volume ID. >>> model.set_scale_out(route_to=1025) >>> # Routes the execution to a specific service type. >>> # model.set_scale_out(route_to='computeserver') >>> # Maps the execution to a specific workload class. >>> # model.set_scale_out(workload_class="WC4") >>> # Activates the sql trace >>> # connection_context.sql_tracer.enable_sql_trace(True) >>> model.fit(data=hdb_df, key='KEY', label='Y') >>> # You can check whether the queries were effectively routed by querying: >>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.
- set_shapley_explainer_of_predict_phase(shapley_explainer, display_force_plot=True)
Use the reason code generated during the prediction phase to build a ShapleyExplainer instance.
When this instance is passed in, the execution results of this instance will be included in the report of v2 version.
- Parameters
- shapley_explainer
ShapleyExplainer
A ShapleyExplainer instance.
- display_force_plotbool, optional
Whether to display the force plot.
Defaults to True.
- shapley_explainer
- set_shapley_explainer_of_score_phase(shapley_explainer, display_force_plot=True)
Use the reason code generated during the scoring phase to build a ShapleyExplainer instance.
When this instance is passed in, the execution results of this instance will be included in the report of v2 version.
- Parameters
- shapley_explainer
ShapleyExplainer
A ShapleyExplainer instance.
- display_force_plotbool, optional
Whether to display the force plot.
Defaults to True.
- shapley_explainer
- class hana_ml.algorithms.apl.gradient_boosting_classification.GradientBoostingBinaryClassifier(conn_context=None, early_stopping_patience=None, eval_metric=None, learning_rate=None, max_depth=None, max_iterations=None, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)
Bases:
_GradientBoostingClassifierBase
SAP HANA APL Gradient Boosting Binary Classifier algorithm. It is very similar to GradientBoostingClassifier, the multiclass classifier. Its particularity lies in the provided metrics which are specific to binary classification.
- Parameters
- conn_contextConnectionContext, optional
The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().
- early_stopping_patience: int, optional
If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. Please refer to APL documentation for default value.
- eval_metric: str, optional
The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are 'LogLoss','AUC' and 'ClassificationError'. Please refer to APL documentation for default value.
- learning_rate: float, optional
The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. Please refer to APL documentation for default value.
- max_depth: int, optional
The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. The default value is 4.
- max_iterations: int, optional
The maximum number of boosting iterations to fit the model. Please refer to APL documentation for default value.
- number_of_jobs: int, optional
Deprecated.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value type (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.
- extra_applyout_settings: dict, optional
Determines the output of the predict() method. The possible values are:
By default (None value): the default output.
<KEY>: the key column if provided in the dataset
TRUE_LABEL: the class label if provided in the dataset
PREDICTED: the predicted label
PROBABILITY: the probability of the prediction (confidence)
{'APL/ApplyExtraMode': 'Individual Contributions'}: the individual contributions of each variable to the score. The output is:
<KEY>: the key column if provided in the dataset
TRUE_LABEL: the class label if provided in the dataset
gb_contrib_<VAR1>: the contribution of the variable VAR1 to the score
...
gb_contrib_<VARN>: the contribution of the variable VARN to the score
gb_contrib_constant_bias: the constant bias contribution to the score
- other_params: dict optional
Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
'max_tasks'
'segment_column_name'
'correlations_lower_bound'
'correlations_max_kept'
'cutting_strategy'
'target_key'
'interactions'
'interactions_max_kept'
'variable_auto_selection'
'variable_selection_max_nb_of_final_variables'
'variable_selection_max_iterations'
'variable_selection_percentage_of_contribution_kept_by_step'
'variable_selection_quality_bar'
See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.
For 'max_tasks', see FUNC_HEADER.
- other_train_apl_aliases: dict, optional
Contains the APL alias for model training. The list of possible aliases depends on the APL version.
See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_classification \ ... import GradientBoostingBinaryClassifier >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS) >>> # -- Creates hana_ml DataFrame >>> hana_df = DataFrame(CONN, 'SELECT * from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = GradientBoostingBinaryClassifier() >>> model.fit(hana_df, label='class', key='id')
Getting variable interactions
>>> model.set_params(other_train_apl_aliases={ ... 'APL/Interactions': 'true', ... 'APL/InteractionsMaxKept': '3' ... }) >>> model.fit(data=self._df_train, key=self._key, label=self._label) >>> # Checks interaction info in INDICATORS table >>> output = model.get_indicators().filter("KEY LIKE 'Interaction%'").collect()
Debriefing
>>> # Global performance metrics of the model >>> model.get_performance_metrics() {'LogLoss': 0.2567069689038737, 'PredictivePower': 0.8529, 'PredictionConfidence': 0.9759, ...}
>>> model.get_feature_importances() {'Gain': OrderedDict([('relationship', 0.3866586685180664), ('education-num', 0.1502334326505661)...
Generating the model report
>>> from hana_ml.visualizers.unified_report import UnifiedReport >>> UnifiedReport(model).build().display()
Making predictions
>>> # Default output >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().sample(3) # returns the output as a pandas DataFrame id TRUE_LABEL PREDICTED PROBABILITY 44903 41211 0 0 0.871326 47878 36020 1 1 0.993455 17549 6601 0 1 0.673872
>>> # Individual Contributions >>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().sample(3) # returns the output as a pandas DataFrame id TRUE_LABEL gb_contrib_age gb_contrib_workclass gb_contrib_fnlwgt ... 0 18448 0 -1.098452 -0.001238 0.060850 ... 1 18457 0 -0.731512 -0.000448 0.020060 ... 2 18540 0 -0.024523 0.027065 0.158083 ...
Saving the model in the schema named 'MODEL_STORAGE'. Please see model_storage class for further features of model storage.
>>> from hana_ml.model_storage import ModelStorage >>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My model name' >>> model_storage.save_model(model=model, if_exists='replace')
Reloading the model for new predictions
>>> model2 = model_storage.load_model(name='My model name') >>> out2 = model2.predict(data=hana_df)
Exporting the model in JSON format
>>> json_export = model.export_apply_code('JSON')
APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.
- Attributes
- label: str
The target column name. This attribute is set when the fit() method is called.
- model_: hana_ml DataFrame
The trained model content
- summary_: APLArtifactTable
The reference to the "SUMMARY" table generated by the model training. This table contains the content of the model training summary.
- indicators_: APLArtifactTable
The reference to the "INDICATORS" table generated by the model training. This table contains various metrics related to the model and model variables.
- fit_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table generated by the model training
- var_desc_: APLArtifactTable
The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training
- applyout_: hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table when a prediction was made
Methods
build_report
([max_local_explanations])Build model report.
HANA execution will be disabled and only SQL script will be generated.
HANA execution will be enabled.
export_apply_code
(code_type[, key, label, ...])Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.
fit
(data[, key, features, label, weight, ...])Fits the model.
generate_html_report
(filename)Save model report as a html file.
Render model report as a notebook iframe.
Gets the version and configuration information about the installation of SAP HANA APL.
Return the object recorder (for Design-time artifacts generation)
Returns the iteration that has provided the best performance on the validation dataset during the model training.
get_debrief_report
(report_name)Retrieves a standard statistical report.
Returns the values of the evaluation metric at each iteration.
Returns the feature importances.
Retrieves the operation log table after the model training.
Retrieves the Indicator table after model training.
Get information about an existing model.
Retrieves attributes of the current object.
Returns the performance metrics of the last trained model.
Retrieves the operation log table after the model training.
Retrieves the summary table after model training.
Checks if the model can be saved.
load_model
(schema_name, table_name[, oid])Loads the model from a table.
predict
(data[, prediction_type])Makes predictions with the fitted model.
save_artifact
(artifact_df, schema_name, ...)Saves an artifact, a temporary table, into a permanent table.
save_model
(schema_name, table_name[, ...])Saves the model into a table.
schedule_fit
(output_table_name_model, ...)Creates a HANA scheduler job for the model fitting.
schedule_predict
(output_table_name_applyout, ...)Creates a HANA scheduler job for the model prediction.
score
(data)Returns the accuracy score on the provided test dataset.
set_framework_version
(framework_version)Switch v1/v2 version of report.
set_metric_samplings
([roc_sampling, ...])Set metric samplings to report builder.
set_params
(**parameters)Sets attributes of the current model.
set_scale_out
([route_to, no_route_to, ...])Specifies hints for scaling-out environment.
Use the reason code generated during the prediction phase to build a ShapleyExplainer instance.
set_shapley_explainer_of_score_phase
(...[, ...])Use the reason code generated during the scoring phase to build a ShapleyExplainer instance.
- set_params(**parameters)
Sets attributes of the current model.
- Parameters
- parameters: dict
The attribute names and values
- score(data)
Returns the accuracy score on the provided test dataset.
- Parameters
- data: hana_ml DataFrame
The test dataset used to compute the score. The labels must be provided in the dataset.
- Returns
- Float or pandas DataFrame
If no segment column is given, the accuracy score.
If a segment column is given, a pandas DataFrame which contains the accuracy score for each segment.
- build_report(max_local_explanations=100)
Build model report.
- Parameters
- max_local_explanations: int, optional
The maximum number of local explanations displayed in the report.
- disable_hana_execution()
HANA execution will be disabled and only SQL script will be generated.
- enable_hana_execution()
HANA execution will be enabled.
- export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)
Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.
- Parameters
- code_type: str
The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).
- key: str, optional
The name of the primary key column. Required for some code types.
- label: str, optional
The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.
- schema_name: str, optional
The schema name of the apply-in table. Required for some code types.
- table_name: str, optional
The apply-in table name. Required for some code types.
- other_params: dict, optional
The additional parameters to be included in the configuration. The available parameters are given in the developer guide.
- Returns
- The exported code: str
Examples
Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)
>>> json_export = model.export_apply_code('JSON')
APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.
Exporting SQL apply code (available for Robust Regression and Clustering)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS')
Exporting SQL apply code (probability generated in the output)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS', ... other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings', ... 'APL/ApplyProba': 'true'})
- fit(data, key=None, features=None, label=None, weight=None, build_report=False)
Fits the model.
- Parameters
- dataDataFrame
The training dataset
- keystr, optional
The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.
- featureslist of str, optional
The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.
- labelstr, optional
The name of the label column. Default is the last column.
- weightstr, optional
The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.
- build_reportbool, optional
Whether to build report or not. Defaults to False.
- Returns
- selfobject
Notes
It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.
- generate_html_report(filename)
Save model report as a html file.
- Parameters
- filenamestr
Html file name.
- generate_notebook_iframe_report()
Render model report as a notebook iframe.
- get_apl_version()
Gets the version and configuration information about the installation of SAP HANA APL.
- Returns
- A pandas Dataframe with detailed information about the current version.
Notes
Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.
- get_artifacts_recorder()
Return the object recorder (for Design-time artifacts generation)
- get_best_iteration()
Returns the iteration that has provided the best performance on the validation dataset during the model training.
- Returns
- int or pandas DataFrame
If no segment column is given, the best iteration.
If a segment column is given, a pandas DataFrame which contains the best iteration for each segment.
- get_debrief_report(report_name)
Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.
- Parameters
- report_name: str
- Returns
- Statistical report: hana_ml DataFrame
- get_evalmetrics()
Returns the values of the evaluation metric at each iteration. These values are based on the validation dataset.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary: {'<MetricName>': <List of values>}.
If a segment column is given, a pandas DataFrame which contains the evaluation metrics for each segment.
- get_feature_importances()
Returns the feature importances.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary: { <importance_metric> : OrderedDictionary({ <feature_name> : <value> }) }.
If a segment column is given, a pandas DataFrame which contains the feature importances for each segment.
- get_fit_operation_log()
Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
- get_indicators()
Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
- get_model_info()
Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.
- Returns
- list
List of HANA DataFrames respectively corresponding to the following tables:
Summary table
Variable roles table
Variable description table
Indicators table
Profit curves table
- get_params()
Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
- get_performance_metrics()
Returns the performance metrics of the last trained model.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary with metric name as key and metric value as value.
If a segment column is given, a pandas DataFrame which contains the performance metrics for each segment.
Examples
>>> data = DataFrame(conn, 'SELECT * from APL_SAMPLES.CENSUS') >>> model = GradientBoostingBinaryClassifier(conn) >>> model.fit(data=data, key='id', label='class') >>> model.get_performance_metrics() {'AUC': 0.9385, 'PredictivePower': 0.8529, 'PredictionConfidence': 0.9759,...}
- get_predict_operation_log()
Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
- get_summary()
Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
- is_fitted()
Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
- load_model(schema_name, table_name, oid=None)
Loads the model from a table.
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
- predict(data, prediction_type=None)
Makes predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the 'prediction_type' parameter.
- Parameters
- data: hana_ml DataFrame
The input dataset used for prediction
- prediction_type: string, optional
Can be: - 'BestProbabilityAndDecision': return the probability value associated with the classification decision (default) - 'Decision': return the classification decision - 'Probability': return the probability that the row is a positive target (in binary classification) or the probabilities of all classes (in multiclass classification) - 'Score': return raw prediction scores - 'Individual Contributions': return SHAP values - 'Explanations': return strength indicators based on SHAP values
- Returns
- Prediction output: hana_ml DataFrame
- save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)
Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact(artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace')
- save_model(schema_name, table_name, if_exists='fail', new_oid=None)
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
The model is saved into a table with the following columns:
"OID" NVARCHAR(50), -- Serve as ID
"FORMAT" NVARCHAR(50), -- APL technical info
"LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model
- schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)
Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- output_table_name_model: str
The output table name for the model binary.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- output_table_name_indicators: str
The output table name for the model indicators.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.schedule_fit(
output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)
Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- input_table_name_model: str
The input table name for the model binary.
- output_table_name_applyout: str
The output table name for the prediction data.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.predict(data)
- >>> model.schedule_predict(
output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- set_framework_version(framework_version)
Switch v1/v2 version of report.
- Parameters
- framework_version{'v2', 'v1'}, optional
v2: using report builder framework. v1: using pure html template.
Defaults to 'v2'.
- set_metric_samplings(roc_sampling: Optional[Sampling] = None, other_samplings: Optional[dict] = None)
Set metric samplings to report builder.
- Parameters
- roc_sampling
Sampling
, optional ROC sampling.
- other_samplingsdict, optional
Key is column name of metric table.
CUMGAINS
RANDOM_CUMGAINS
PERF_CUMGAINS
LIFT
RANDOM_LIFT
PERF_LIFT
CUMLIFT
RANDOM_CUMLIFT
PERF_CUMLIFT
Value is sampling.
- roc_sampling
Examples
Creating the metric samplings:
>>> roc_sampling = Sampling(method='every_nth', interval=2)
>>> other_samplings = dict(CUMGAINS=Sampling(method='every_nth', interval=2), LIFT=Sampling(method='every_nth', interval=2), CUMLIFT=Sampling(method='every_nth', interval=2)) >>> model.set_metric_samplings(roc_sampling, other_samplings)
- set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)
Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.
- Parameters
- route_tostr, optional
Routes the query to the specified volume ID or service type.
Defaults to None.
- no_route_tostr or list of str, optional
Avoids query routing to a specified volume ID or service type.
Defaults to None.
- route_bystr, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s).
Defaults to None.
- route_by_cardinalitystr or list of str, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.
- data_transfer_costint, optional
Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.
Defaults to None.
- route_optimization_level{'minimal', 'all'}, optional
Guides the optimizer to compile with
route_optimization_level
'minimal' or to default toroute_optimization_level
. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.- workload_classstr, optional
Routes the query via workload class.
route_to
statement hint has higher precedence thanworkload_class
statement hint.Defaults to None.
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier >>> model = GradientBoostingBinaryClassifier() >>> # Routes the execution to a specific volume ID. >>> model.set_scale_out(route_to=1025) >>> # Routes the execution to a specific service type. >>> # model.set_scale_out(route_to='computeserver') >>> # Maps the execution to a specific workload class. >>> # model.set_scale_out(workload_class="WC4") >>> # Activates the sql trace >>> # connection_context.sql_tracer.enable_sql_trace(True) >>> model.fit(data=hdb_df, key='KEY', label='Y') >>> # You can check whether the queries were effectively routed by querying: >>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.
- set_shapley_explainer_of_predict_phase(shapley_explainer, display_force_plot=True)
Use the reason code generated during the prediction phase to build a ShapleyExplainer instance.
When this instance is passed in, the execution results of this instance will be included in the report of v2 version.
- Parameters
- shapley_explainer
ShapleyExplainer
A ShapleyExplainer instance.
- display_force_plotbool, optional
Whether to display the force plot.
Defaults to True.
- shapley_explainer
- set_shapley_explainer_of_score_phase(shapley_explainer, display_force_plot=True)
Use the reason code generated during the scoring phase to build a ShapleyExplainer instance.
When this instance is passed in, the execution results of this instance will be included in the report of v2 version.
- Parameters
- shapley_explainer
ShapleyExplainer
A ShapleyExplainer instance.
- display_force_plotbool, optional
Whether to display the force plot.
Defaults to True.
- shapley_explainer
hana_ml.algorithms.apl.gradient_boosting_regression
This module provides the SAP HANA APL gradient boosting regression algorithm.
The following classes are available:
- class hana_ml.algorithms.apl.gradient_boosting_regression.GradientBoostingRegressor(conn_context=None, early_stopping_patience=None, eval_metric=None, learning_rate=None, max_depth=None, max_iterations=None, number_of_jobs=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)
Bases:
GradientBoostingBase
,_UnifiedRegressionReportBuilder
SAP HANA APL Gradient Boosting Regression algorithm.
- Parameters
- conn_contextConnectionContext, optional
The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().
- early_stopping_patience: int, optional
If the performance does not improve after early_stopping_patience iterations, the model training will stop before reaching max_iterations. Please refer to APL documentation for default value.
- eval_metric: str, optional
The name of the metric used to evaluate the model performance on validation dataset along the boosting iterations. The possible values are 'MAE' and 'RMSE'. Please refer to APL documentation for default value.
- learning_rate: float, optional
The weight parameter controlling the model regularization to avoid overfitting risk. A small value improves the model generalization to unseen dataset at the expense of the computational cost. Please refer to APL documentation for default value.
- max_depth: int, optional
The maximum depth of the decision trees added as a base learner to the model at each boosting iteration. Please refer to APL documentation for default value.
- max_iterations: int, optional
The maximum number of boosting iterations to fit the model. Please refer to APL documentation for default value.
- number_of_jobs: int, optional
Deprecated.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value types (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.
- extra_applyout_settings: dict, optional
Determines the output of the predict() method. The possible values are:
By default (None value): the default output.
<KEY>: the key column if provided in the dataset
TRUE_LABEL: the actual value if provided
PREDICTED: the predicted value
{'APL/ApplyExtraMode': 'Individual Contributions'}: the feature importance for every sample
<KEY>: the key column if provided
TRUE_LABEL: the actual value if provided
PREDICTED: the predicted value
gb_contrib_<VAR1>: the contribution of the VAR1 variable to the score
...
gb_contrib_<VARN>: the contribution of the VARN variable to the score
gb_contrib_constant_bias: the constant bias contribution
- other_params: dict optional
Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
'max_tasks'
'segment_column_name'
'correlations_lower_bound'
'correlations_max_kept'
'cutting_strategy'
'interactions'
'interactions_max_kept'
'variable_auto_selection'
'variable_selection_max_nb_of_final_variables'
'variable_selection_max_iterations'
'variable_selection_percentage_of_contribution_kept_by_step'
'variable_selection_quality_bar'
See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.
For 'max_tasks', see FUNC_HEADER.
- other_train_apl_aliases: dict, optional
Users can provide APL aliases as advanced settings to the model. Users are free to input any possible value.
See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.
Notes
It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. The key is particularly useful to join the predictions output to the input dataset.
By default, if not provided, SAP HANA APL guesses the variable description by reading the first 100 rows. But, the results may be incorrect. The user can overwrite the guessed description by explicitly setting the variable_storages, variable_value_types and variable_missing_strings parameters. For example:
model.set_params(variable_storages={ 'ID': 'integer', 'sepal length (cm)': 'number' }) model.set_params(variable_value_types={ 'sepal length (cm)': 'continuous' }) model.set_params(variable_missing_strings={ 'sepal length (cm)': '-1' })
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_regression import GradientBoostingRegressor >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS') >>> # -- Creates hana_ml DataFrame >>> hana_df = DataFrame(CONN, ... 'SELECT "id", "class", "capital-gain", ' ... '"native-country", "age" from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = GradientBoostingRegressor() >>> model.fit(hana_df, label='age', key='id')
Getting variable interactions
>>> model.set_params(other_train_apl_aliases={ ... 'APL/Interactions': 'true', ... 'APL/InteractionsMaxKept': '3' ... }) >>> model.fit(data=self._df_train, key=self._key, label=self._label) >>> # Checks interaction info in INDICATORS table >>> output = model.get_indicators().filter("KEY LIKE 'Interaction%'").collect()
Debriefing
>>> # Global performance metrics of the model >>> model.get_performance_metrics() {'L1': 7.31774, 'MeanAbsoluteError': 7.31774, 'L2': 9.42497, 'RootMeanSquareError': 9.42497, ...
>>> model.get_feature_importances() {'Gain': OrderedDict([('class', 0.8728259801864624), ('capital-gain', 0.10493823140859604), ...
Generating the model report
>>> from hana_ml.visualizers.unified_report import UnifiedReport >>> UnifiedReport(model).build().display()
Making predictions
>>> # Default output >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame id TRUE_LABEL PREDICTED 39184 21772 27 25 16537 7331 33 43 7908 35226 65 42 >>> # Individual Contributions >>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect().head(3) # returns the output as a pandas DataFrame id TRUE_LABEL gb_contrib_workclass gb_contrib_fnlwgt gb_contrib_education ... 0 6241 21 -1.330736 -0.385088 0.373539 ... 1 6248 18 -0.784536 -2.191791 -1.788672 ... 2 6253 26 -0.773891 0.358133 -0.185864 ...
Saving the model in the schema named 'MODEL_STORAGE'
>>> from hana_ml.model_storage import ModelStorage >>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My model name' >>> model_storage.save_model(model=model, if_exists='replace')
Reloading the model for new predictions
>>> model2 = model_storage.load_model(name='My model name') >>> out2 = model2.predict(data=hana_df)
Please see model_storage class for further features of model storage
Exporting the model in JSON format
>>> json_export = model.export_apply_code('JSON')
APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.
- Attributes
- label: str
The target column name. This attribute is set when the fit() method is called. Users don't need to set it explicitly, except if the model is loaded from a table. In this case, this attribute must be set before calling predict().
- model_: hana_ml DataFrame
The trained model content
- summary_: APLArtifactTable
The reference to the "SUMMARY" table generated by the model training. This table contains the content of the model training summary.
- indicators_: APLArtifactTable
The reference to the "INDICATORS" table generated by the model training. This table contains various metrics related to the model and model variables.
- fit_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table generated by the model training
- var_desc_: APLArtifactTable
The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training
- applyout_: hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table when a prediction was made
Methods
build_report
([max_local_explanations])Build model report.
HANA execution will be disabled and only SQL script will be generated.
HANA execution will be enabled.
export_apply_code
(code_type[, key, label, ...])Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.
fit
(data[, key, features, label, weight, ...])Fits the model.
generate_html_report
(filename)Save model report as a html file.
Render model report as a notebook iframe.
Gets the version and configuration information about the installation of SAP HANA APL.
Return the object recorder (for Design-time artifacts generation)
Returns the iteration that has provided the best performance on the validation dataset during the model training.
get_debrief_report
(report_name)Retrieves a standard statistical report.
Returns the values of the evaluation metric at each iteration.
Returns the feature importances.
Retrieves the operation log table after the model training.
Retrieves the Indicator table after model training.
Get information about an existing model.
Retrieves attributes of the current object.
Returns the performance metrics of the last trained model.
Retrieves the operation log table after the model training.
Retrieves the summary table after model training.
Checks if the model can be saved.
load_model
(schema_name, table_name[, oid])Loads the model from a table.
predict
(data[, prediction_type])Generates predictions with the fitted model.
save_artifact
(artifact_df, schema_name, ...)Saves an artifact, a temporary table, into a permanent table.
save_model
(schema_name, table_name[, ...])Saves the model into a table.
schedule_fit
(output_table_name_model, ...)Creates a HANA scheduler job for the model fitting.
schedule_predict
(output_table_name_applyout, ...)Creates a HANA scheduler job for the model prediction.
score
(data)Returns the R2 score (coefficient of determination) on the provided test dataset.
set_framework_version
(framework_version)Switch v1/v2 version of report.
set_params
(**parameters)Sets attributes of the current model.
set_scale_out
([route_to, no_route_to, ...])Specifies hints for scaling-out environment.
Use the reason code generated during the prediction phase to build a ShapleyExplainer instance.
set_shapley_explainer_of_score_phase
(...[, ...])Use the reason code generated during the scoring phase to build a ShapleyExplainer instance.
- set_params(**parameters)
Sets attributes of the current model.
- Parameters
- parameters: dict
The attribute names and values
- predict(data, prediction_type=None)
Generates predictions with the fitted model. It is possible to add special outputs, such as variable individual contributions, through the 'prediction_type' parameter.
- Parameters
- data: hana_ml DataFrame
The input dataset used for prediction
- prediction_type: string, optional
Can be: - 'Score': return predicted value (default) - 'Individual Contributions': return SHAP values - 'Explanations': return strength indicators based on SHAP values
- Returns
- Prediction output: hana_ml DataFrame
- score(data)
Returns the R2 score (coefficient of determination) on the provided test dataset.
- Parameters
- data: hana_ml DataFrame
The test dataset used to compute the score. The labels must be provided in the dataset.
- Returns
- Float or pandas DataFrame
If no segment column is given, the R2 score.
If a segment column is given, a pandas DataFrame which contains the R2 score for each segment.
- build_report(max_local_explanations=100)
Build model report.
- Parameters
- max_local_explanations: int, optional
The maximum number of local explanations displayed in the report.
- disable_hana_execution()
HANA execution will be disabled and only SQL script will be generated.
- enable_hana_execution()
HANA execution will be enabled.
- export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)
Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.
- Parameters
- code_type: str
The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).
- key: str, optional
The name of the primary key column. Required for some code types.
- label: str, optional
The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.
- schema_name: str, optional
The schema name of the apply-in table. Required for some code types.
- table_name: str, optional
The apply-in table name. Required for some code types.
- other_params: dict, optional
The additional parameters to be included in the configuration. The available parameters are given in the developer guide.
- Returns
- The exported code: str
Examples
Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)
>>> json_export = model.export_apply_code('JSON')
APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.
Exporting SQL apply code (available for Robust Regression and Clustering)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS')
Exporting SQL apply code (probability generated in the output)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS', ... other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings', ... 'APL/ApplyProba': 'true'})
- fit(data, key=None, features=None, label=None, weight=None, build_report=False)
Fits the model.
- Parameters
- dataDataFrame
The training dataset
- keystr, optional
The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not the recommended usage. See notes below.
- featureslist of str, optional
The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.
- labelstr, optional
The name of the label column. Default is the last column.
- weightstr, optional
The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.
- build_reportbool, optional
Whether to build report or not. Defaults to False.
- Returns
- selfobject
Notes
It is highly recommended to specify a key column in the training dataset. If not, once the model is trained, it won't be possible anymore to have a key defined in any input dataset. That is particularly inconvenient to join the predictions output to the input dataset.
- generate_html_report(filename)
Save model report as a html file.
- Parameters
- filenamestr
Html file name.
- generate_notebook_iframe_report()
Render model report as a notebook iframe.
- get_apl_version()
Gets the version and configuration information about the installation of SAP HANA APL.
- Returns
- A pandas Dataframe with detailed information about the current version.
Notes
Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.
- get_artifacts_recorder()
Return the object recorder (for Design-time artifacts generation)
- get_best_iteration()
Returns the iteration that has provided the best performance on the validation dataset during the model training.
- Returns
- int or pandas DataFrame
If no segment column is given, the best iteration.
If a segment column is given, a pandas DataFrame which contains the best iteration for each segment.
- get_debrief_report(report_name)
Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.
- Parameters
- report_name: str
- Returns
- Statistical report: hana_ml DataFrame
- get_evalmetrics()
Returns the values of the evaluation metric at each iteration. These values are based on the validation dataset.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary: {'<MetricName>': <List of values>}.
If a segment column is given, a pandas DataFrame which contains the evaluation metrics for each segment.
- get_feature_importances()
Returns the feature importances.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary: { <importance_metric> : OrderedDictionary({ <feature_name> : <value> }) }.
If a segment column is given, a pandas DataFrame which contains the feature importances for each segment.
- get_fit_operation_log()
Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
- get_indicators()
Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
- get_model_info()
Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.
- Returns
- list
List of HANA DataFrames respectively corresponding to the following tables:
Summary table
Variable roles table
Variable description table
Indicators table
Profit curves table
- get_params()
Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
- get_performance_metrics()
Returns the performance metrics of the last trained model.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary with metric name as key and metric value as value.
If a segment column is given, a pandas DataFrame which contains the performance metrics for each segment.
Examples
>>> data = DataFrame(conn, 'SELECT * from APL_SAMPLES.CENSUS') >>> model = GradientBoostingBinaryClassifier(conn) >>> model.fit(data=data, key='id', label='class') >>> model.get_performance_metrics() {'AUC': 0.9385, 'PredictivePower': 0.8529, 'PredictionConfidence': 0.9759,...}
- get_predict_operation_log()
Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
- get_summary()
Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
- is_fitted()
Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
- load_model(schema_name, table_name, oid=None)
Loads the model from a table.
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
- save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)
Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact(artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace')
- save_model(schema_name, table_name, if_exists='fail', new_oid=None)
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
The model is saved into a table with the following columns:
"OID" NVARCHAR(50), -- Serve as ID
"FORMAT" NVARCHAR(50), -- APL technical info
"LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model
- schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)
Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- output_table_name_model: str
The output table name for the model binary.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- output_table_name_indicators: str
The output table name for the model indicators.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.schedule_fit(
output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)
Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- input_table_name_model: str
The input table name for the model binary.
- output_table_name_applyout: str
The output table name for the prediction data.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.predict(data)
- >>> model.schedule_predict(
output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- set_framework_version(framework_version)
Switch v1/v2 version of report.
- Parameters
- framework_version{'v2', 'v1'}, optional
v2: using report builder framework. v1: using pure html template.
Defaults to 'v2'.
- set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)
Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.
- Parameters
- route_tostr, optional
Routes the query to the specified volume ID or service type.
Defaults to None.
- no_route_tostr or list of str, optional
Avoids query routing to a specified volume ID or service type.
Defaults to None.
- route_bystr, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s).
Defaults to None.
- route_by_cardinalitystr or list of str, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.
- data_transfer_costint, optional
Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.
Defaults to None.
- route_optimization_level{'minimal', 'all'}, optional
Guides the optimizer to compile with
route_optimization_level
'minimal' or to default toroute_optimization_level
. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.- workload_classstr, optional
Routes the query via workload class.
route_to
statement hint has higher precedence thanworkload_class
statement hint.Defaults to None.
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier >>> model = GradientBoostingBinaryClassifier() >>> # Routes the execution to a specific volume ID. >>> model.set_scale_out(route_to=1025) >>> # Routes the execution to a specific service type. >>> # model.set_scale_out(route_to='computeserver') >>> # Maps the execution to a specific workload class. >>> # model.set_scale_out(workload_class="WC4") >>> # Activates the sql trace >>> # connection_context.sql_tracer.enable_sql_trace(True) >>> model.fit(data=hdb_df, key='KEY', label='Y') >>> # You can check whether the queries were effectively routed by querying: >>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.
- set_shapley_explainer_of_predict_phase(shapley_explainer, display_force_plot=True)
Use the reason code generated during the prediction phase to build a ShapleyExplainer instance.
When this instance is passed in, the execution results of this instance will be included in the report of v2 version.
- Parameters
- shapley_explainer
ShapleyExplainer
ShapleyExplainer instance.
- display_force_plotbool, optional
Whether to display the force plot.
Defaults to True.
- shapley_explainer
- set_shapley_explainer_of_score_phase(shapley_explainer, display_force_plot=True)
Use the reason code generated during the scoring phase to build a ShapleyExplainer instance.
When this instance is passed in, the execution results of this instance will be included in the report of v2 version.
- Parameters
- shapley_explainer
ShapleyExplainer
ShapleyExplainer instance.
- display_force_plotbool, optional
Whether to display the force plot.
Defaults to True.
- shapley_explainer
hana_ml.algorithms.apl.time_series
This module contains the SAP HANA APL Time Series algorithm.
The following class is available:
- class hana_ml.algorithms.apl.time_series.AutoTimeSeries(conn_context=None, time_column_name=None, target=None, horizon=1, with_extra_predictable=True, last_training_time_point=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, train_data_=None, sort_data=True, **other_params)
Bases:
APLBase
SAP HANA APL Time Series algorithm.
- Parameters
- target: str
The name of the column containing the time series data points.
- time_column_name: str
The name of the column containing the time series time points. The time column is used as table key. It can be overridden by setting the 'key' parameter through the fit() method.
- last_training_time_point: str, optional
The last time point used for model training. The training dataset will contain all data points up to this date. By default, this parameter will be set as the last time point until which the target is not null.
- horizon: int, optional
The number of forecasts to be generated by the model upon apply. The time series model will be trained to optimize accuracy on the requested horizon only. The default value is 1.
- with_extra_predictable: bool, optional
If set to true, all input variables will be used by the model to generate forecasts. If set to false, only the time and target columns will be used. All other variables will be ignored. This parameter is set to true by default.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value types (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals '???', it will be taken as missing.
- extra_applyout_settings: dict, optional
Specifies the prediction outputs. See documentation on predict() method for more details.
- sort_data: bool
If True, a temporary view is created on the dataset to sort data by time. However, users can provide directly a view with sorted dates. In this case, they must set sort_data to False to avoid creating a new view. The default value is True. WARNING: it is recommended to leave this parameter by default so the data is guaranteed to be read in sorted order. If the data is not sorted, the model will fail.
- other_params: dict, optional
Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
'max_tasks'
'segment_column_name'
'force_negative_forecast'
'force_positive_forecast'
'forecast_fallback_method'
'forecast_max_cyclics'
'forecast_max_lags'
'forecast_method'
'smoothing_cycle_length'
See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.
For 'max_tasks', see FUNC_HEADER.
- other_train_apl_aliases: dict, optional
Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value. There is no control in python.
Notes
The input dataset, given as an hana_ml dataframe, must not be a temporary table because the API tries to create a view sorted by the time column. SAP HANA does not allow user to create a view on temporary table. However, even though it is not recommended, to avoid creating the view, user can force the parameter sort_data to False.
When calling the fit_predict() method, the time series model is generated on the fly and not returned. If a model must be saved, please consider using the fit() method instead.
When extra-predictable variables are involved, it is usual to have a single dataset used both for the model training and the forecasting. In this case, the dataset should contain two successive periods:
The first one is used for the model training, ranging from the beginning to the last date where the target value is not null.
The second one is used for the model training, ranging from the the first date where the target value is null.
The content of the output of the get_performance_metrics() method may change depending of the version of SAP HANA APL used with this API. Please refer to the SAP HANA APL documentation to know which metrics will be provided.
Examples
>>> from hana_ml.algorithms.apl.time_series import AutoTimeSeries >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS) >>> # -- Creates Hana DataFrame >>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CASHFLOWS_FULL')
Creating and fitting the model
>>> model = AutoTimeSeries(time_column_name='Date', target='Cash', horizon=3) >>> model.fit(data=hana_df)
Debriefing
>>> model.get_model_components() {'Trend': 'Polynom( Date)', 'Cycles': 'PeriodicExtrasPred_MondayMonthInd', 'Fluctuations': 'AR(46)'}
>>> model.get_performance_metrics() {'MAPE': [0.12853715702893018, 0.12789963348617622, 0.12969031859857874], ...}
Generating forecasts using the forecast() method
This method is used to generate forecasts using a signature similar to the one used in PAL. There are two variants of usage as described below:
1) If the model does not use extra-predictable variables (no exogenous variable), users must simply specify the number of forecasts.
>>> train_df = DataFrame(CONN, 'SELECT "Date" , "Cash" ' 'from APL_SAMPLES.CASHFLOWS_FULL ORDER BY 1 LIMIT 100') >>> model = AutoTimeSeries(time_column_name='Date', target='Cash', horizon=3) >>> model.fit(train_df) >>> out = model.forecast(forecast_length=3) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 98 2001-05-23 3057.812544999999772699132909775 4593.966530 NaN NaN 99 2001-05-25 3037.539714999999887176132440567 4307.893346 NaN NaN 100 2001-05-26 None 4206.023158 -3609.599872 12021.646187 101 2001-05-27 None 4575.162651 -3392.283802 12542.609104 102 2001-05-28 None 4830.352462 -3239.507360 12900.212284
2) If the model uses extra-predictable variables, users must provide the values of all extra-predictable variables for each time point of the forecast period. These values must be provided as a hana_ml dataframe with the same structure as the training dataset.
>>> # Trains the dataset with extra-predictable variables >>> train_df = DataFrame(CONN, ... 'SELECT * ' ... 'from APL_SAMPLES.CASHFLOWS_FULL ' ... 'WHERE "Cash" is not null') >>> # Extra-predictable variables' values on the forecast period >>> forecast_df = DataFrame(CONN, ... 'SELECT * ' ... 'from APL_SAMPLES.CASHFLOWS_FULL ' ... 'WHERE "Cash" is null LIMIT 5') >>> model = AutoTimeSeries(time_column_name='Date', target='Cash', horizon=3) >>> model.fit(train_df) >>> out = model.forecast(data=forecast_df) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 251 2001-12-29 None 6864.371407 -224.079492 13952.822306 252 2001-12-30 None 6889.515324 -211.264912 13990.295559 253 2001-12-31 None 6914.766513 -187.180923 14016.713949 254 2002-01-01 None 6940.124974 NaN NaN 255 2002-01-02 None 6965.590706 NaN NaN
Generating forecasts with the predict() method.
The predict() method allows users to apply a fitted model on a dataset different from the training dataset. For example, users can train a dataset on the first quarter (January to March) and apply the model on a dataset of different period (March to May).
>>> # Trains the model on the first quarter, from January to March >>> train_df = DataFrame(CONN, ... 'SELECT "Date" , "Cash" ' ... 'from APL_SAMPLES.CASHFLOWS_FULL ' ... "where "Date" between '2001-01-01' and '2001-03-31'" ... " ORDER BY 1") >>> model.fit(train_df) >>> # Forecasts on a shifted period, from March to May >>> test_df = DataFrame(CONN, ... 'SELECT "Date", "Cash" ' ... 'from APL_SAMPLES.CASHFLOWS_FULL ' ... "where "Date" between '2001-03-01' and '2001-05-31'" ... " ORDER BY 1") >>> out = model.predict(test_df) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 60 2001-05-30 3837.196734000000105879735597214 4630.223083 NaN NaN 61 2001-05-31 2911.884261000000151398126928726 4635.265982 NaN NaN 62 2001-06-01 None 4538.516542 -1087.461104 10164.494188 63 2001-06-02 None 4848.815364 -5090.167255 14787.797983 64 2001-06-03 None 4853.858263 -5138.553275 14846.269801
Using the fit_predict() method
This method enables the user to fit a model and generate forecasts on a single call, and thus get results faster. However, the model is created on the fly and deleted after use, so the user will not be able to save the resulting model.
>>> model.fit_predict(hana_df) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 249 2001-12-27 5995.42329499999 6055.761105 NaN NaN 250 2001-12-28 7111.41669699999 6314.336098 NaN NaN 251 2002-01-03 None 7033.880804 4529.462710 9538.298899 252 2002-01-04 None 6464.557223 3965.343397 8963.771049 253 2002-01-07 None 6469.141663 3961.414900 8976.868427
Breaking down the time series into trend, cycles, fluctuations and residuals components.
If the parameter extra_applyout_settings is set to {'ExtraMode': True}, anytime a forecast method is called, predict(), forecast() or fit_predict(), the output will contain time series components and their corresponding residuals. The prediction columns are suffixed by the horizon number. For instance, 'Cycles_RESIDUALS_3' means the residual of the cycle component in the third horizon.
>>> model.fit(train_df) >>> model.set_params(extra_applyout_settings={'ExtraMode': True}) >>> out = model.predict(hana_df) >>> out.collect().tail(5) Date ACTUAL ... Cycles_RESIDUALS_3 Fluctuations_RESIDUALS_3 249 2001-12-27 5995.42329499392507553 ... 32.51 4.48e-13 250 2001-12-28 7111.41669699455205917 ... -644.77 1.14e-13 251 2002-01-03 None ... NaN NaN 252 2002-01-04 None ... NaN NaN 253 2002-01-07 None ... NaN NaN
Users can change the fields that are included in the output by using the APL/ApplyExtraMode alias in extra_applyout_settings, for instance: {'APL/ApplyExtraMode': 'First Forecast with Stable Components and Residues and Error Bars'}. Please check the SAP HANA APL documentation to know which values are available for APL/ApplyExtraMode. See Function Reference > Predictive Model Services > APPLY_MODEL > Advanced Apply Settings in the SAP HANA APL Developer Guide.
- Attributes
- model_: hana_ml DataFrame
The trained model content
- summary_: APLArtifactTable
The reference to the "SUMMARY" table generated by the model training. This table contains the summary about the model training.
- indicators_: APLArtifactTable
The reference to the "INDICATORS" table generated by the model training. This table contains the various metrics related to the model and its variables.
- fit_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table generated by the model training
- var_desc_: APLArtifactTable
The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training
- applyout_: hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table that is produced when making predictions.
- train_data_: hana_ml DataFrame
The train dataset
Methods
build_report
([segment_name])Build model report.
HANA execution will be disabled and only SQL script will be generated.
HANA execution will be enabled.
export_apply_code
(code_type[, key, label, ...])Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.
fit
(data[, key, features, build_report])Fits the model.
fit_predict
(data[, key, features, horizon, ...])Fits a model and generate forecasts in a single call to the FORECAST APL function.
forecast
([forecast_length, data, build_report])Uses the fitted model to generate out-of-sample forecasts.
generate_html_report
(filename)Save model report as a html file.
Render model report as a notebook iframe.
Gets the version and configuration information about the installation of SAP HANA APL.
Return the object recorder (for Design-time artifacts generation)
get_debrief_report
(report_name)Retrieves a standard statistical report.
Retrieves the operation log table after the model training.
get_horizon_wide_metric
([metric_name])Returns value of performance metric (MAPE, sMAPE, ...) averaged on the forecast horizon.
Retrieves the Indicator table after model training.
Returns the description of the model components, that is trend, cycles and fluctuations, used by the model to generate forecasts.
Get information about an existing model.
Retrieves attributes of the current object.
Returns the performance metrics of the model.
Retrieves the operation log table after the model training.
Retrieves the summary table after model training.
Checks if the model can be saved.
load_model
(schema_name, table_name[, oid])Loads the model from a table.
predict
(data[, apply_horizon, ...])Uses the fitted model to generate forecasts.
save_artifact
(artifact_df, schema_name, ...)Saves an artifact, a temporary table, into a permanent table.
save_model
(schema_name, table_name[, ...])Saves the model into a table.
schedule_fit
(output_table_name_model, ...)Creates a HANA scheduler job for the model fitting.
schedule_predict
(output_table_name_applyout, ...)Creates a HANA scheduler job for the model prediction.
set_params
(**parameters)Sets attributes of the current model.
set_scale_out
([route_to, no_route_to, ...])Specifies hints for scaling-out environment.
- set_params(**parameters)
Sets attributes of the current model.
- Parameters
- parameters: dict
Contains attribute names and values in the form of keyword arguments
- fit(data, key=None, features=None, build_report=False)
Fits the model.
- Parameters
- data: hana_ml DataFrame
The training dataset
- key: str, optional
The column used as row identifier of the dataset. This column corresponds to the time column name. As a result, setting this parameter will overwrite the time_column_name model setting.
- features: list of str, optional
The names of the feature columns, meaning the date column and the extra-predictive variables. If features is not provided, it defaults to all columns except the target column.
- build_report: bool, optional
Whether to build report or not. Defaults to False.
- Returns
- self: object
- predict(data, apply_horizon=None, apply_last_time_point=None, build_report=False)
Uses the fitted model to generate forecasts.
- Parameters
- data: hana_ml DataFrame
The input dataset used for predictions
- apply_horizon: int, optional
The number of forecasts to generate. By default, the number of forecasts is the horizon on which the model was trained.
- apply_last_time_point: str, optional
The time point corresponding to the start of the forecast period. Forecasts will be generated starting from the next time point after the 'apply_last_time_point'. By default, this parameter is set to the value of 'last_training_time_point' known from the model training.
- build_report: bool, optional
Whether to build report or not. Defaults to False.
- Returns
- hana_ml DataFrame
By default the output contains the following columns:
<the name of the time column>
ACTUAL: the actual value of time series
PREDICTED: the forecast value
LOWER_INT_95PCT: the lower limit of 95% confidence interval
UPPER_INT_95PCT: the upper limit of 95% confidence interval
If ExtraMode is set to true, the output dataframe will also contain the breaking down of the time series into a trend, cycles, fluctuations and residuals components.
Examples
Default output
>>> out = model.predict(hana_df) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 249 2001-12-27 5995.42329499999 6055.761105 NaN NaN 250 2001-12-28 7111.41669699999 6314.336098 NaN NaN 251 2002-01-03 None 7033.88080 4529.46271 9538.29889 252 2002-01-04 None 6464.55722 3965.34339 8963.77104 253 2002-01-07 None 6469.14166 3961.41490 8976.86842
Retrieving forecasts and components (predicted, trend, cycles and fluctuations).
The output columns are suffixed with the horizon index. For example, Trend_1 means the trend component of the first horizon.
>>> model.set_params(extra_applyout_settings={'ExtraMode': True}) >>> out = model.predict(hana_df) >>> out.collect().tail(5) Date ACTUAL PREDICTED_1 Trend_1 249 2001-12-27 5995.423294999999598076101392507553 6055.761105 6814.405390 ... 250 2001-12-28 7111.416696999999658146407455205917 6314.336098 6839.334762 ... 251 2002-01-03 None 7033.880804 6991.163710 ... 252 2002-01-04 None 6464.557223 7016.843985 ... 253 2002-01-07 None 6469.141663 7094.528433 ...
Users can change the fields that are included in the output by using the APL/ApplyExtraMode alias in extra_applyout_settings, for instance: {'APL/ApplyExtraMode': 'First Forecast with Stable Components and Residues and Error Bars'}. Please check the SAP HANA APL documentation to know which values are available for APL/ApplyExtraMode. See Function Reference > Predictive Model Services > APPLY_MODEL > Advanced Apply Settings in the SAP HANA APL Developer Guide.
- fit_predict(data, key=None, features=None, horizon=None, build_report=False)
Fits a model and generate forecasts in a single call to the FORECAST APL function. This method offers a faster way to perform the model training and forecasting.
However, the user will not have access to the model used internally since it is deleted after the computation of the forecasts.
- Parameters
- data: hana_ml DataFrame
The input time series dataset
- key: str, optional
The date column name. By default, it is equal to the model parameter time_column_name. If it is given, the model parameter time_column_name will be overwritten.
- features: list of str, optional
The column names corresponding to the extra-predictable variables (exogenous variables). If features is not provided, it is equal to all columns except the target column.
- horizon: int, optional
The number of forecasts to generate. The default value equals to the horizon parameter of the model.
- build_reportbool, optional
Whether to build report or not. Defaults to False.
- Returns
- hana_ml DataFrame
The output is the same as the predict() method.
- forecast(forecast_length=None, data=None, build_report=False)
Uses the fitted model to generate out-of-sample forecasts. The model is supposed to be already fitted with a given dataset (training dataset). This method forecasts over a number of steps after the end of the training dataset. When there are extra-predictive variable (exogenous variables), the input parameter data is required. It must contain the values of the extra-predictable variables for the forecast period. If there is no extra-predictive variable, only the forecast_length parameter is needed.
- Parameters
- forecast_length: int, optional
The number of forecasts to generate from the end of the train dataset. This parameter is by default the horizon specified in the model parameter.
- data: hana_ml DataFrame, optional
The time series with extra-predictable variables used for forecasting. This parameter is required if extra-predictive variables are used in the model. When this parameter is given, the parameter 'forecast_length' is ignored.
- build_reportbool, optional
Whether to build report or not. Defaults to False.
- Returns
- hana_ml DataFrame
The output is the same as the predict() method.
Examples
Case where there is no extra-predictable variable:
>>> train_df = DataFrame(CONN, 'SELECT "Date" , "Cash" ' 'from APL_SAMPLES.CASHFLOWS_FULL ' 'where "Cash" is not null ' 'ORDER BY 1') >>> print(train_df.collect().tail(5)) Date Cash 246 2001-12-20 6382.441052 247 2001-12-21 5652.882539 248 2001-12-26 5081.372996 249 2001-12-27 5995.423295 250 2001-12-28 7111.416697
>>> model = AutoTimeSeries(CONN, time_column_name='Date', target='Cash', horizon=3) >>> model.fit(train_df) >>> out = model.forecast(forecast_length=3) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 249 2001-12-27 5995.42329499999901392507553 6814.405390 NaN NaN 250 2001-12-28 7111.41669699999907455205917 6839.334762 NaN NaN 251 2001-12-29 None 6864.371407 -224.079492 13952.822306 252 2001-12-30 None 6889.515324 -211.264912 13990.295559 253 2001-12-31 None 6914.766513 -187.180923 14016.713949
Case where there are extra-predictable variables:
>>> train_df = DataFrame(CONN, 'SELECT * ' 'from APL_SAMPLES.CASHFLOWS_FULL ' 'WHERE "Cash" is not null ' 'ORDER BY 1') >>> print(train_df.collect().tail(5)) Date WorkingDaysIndices ... BeforeLastWMonth Cash 246 2001-12-20 13 ... 1 6382.441052 247 2001-12-21 14 ... 1 5652.882539 248 2001-12-26 15 ... 0 5081.372996 249 2001-12-27 16 ... 0 5995.423295 250 2001-12-28 17 ... 0 7111.416697
>>> # Extra-predictable variables to be provided as the forecast period >>> forecast_df = DataFrame(CONN, 'SELECT * ' 'from APL_SAMPLES.CASHFLOWS_FULL ' 'WHERE "Cash" is null ' 'ORDER BY 1 ' 'LIMIT 3') >>> print(forecast_df.collect()) Date WorkingDaysIndices ... BeforeLastWMonth Cash 0 2002-01-03 0 ... 0 None 1 2002-01-04 1 ... 0 None 2 2002-01-07 2 ... 0 None
>>> model = AutoTimeSeries(CONN, time_column_name='Date', target='Cash', horizon=3) >>> model.fit(train_df) >>> out = model.forecast(data=forecast_df) >>> out.collect().tail(5) Date ACTUAL PREDICTED LOWER_INT_95PCT UPPER_INT_95PCT 249 2001-12-27 5995.4232949999996101392507553 6814.41 NaN NaN 250 2001-12-28 7111.4166969999996407455205917 6839.33 NaN NaN 251 2001-12-29 None 6864.37 -224.08 13952.82 252 2001-12-30 None 6889.52 -211.26 13990.30 253 2001-12-31 None 6914.77 -187.18 14016.71
- get_model_components()
Returns the description of the model components, that is trend, cycles and fluctuations, used by the model to generate forecasts.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary with 3 possible keys: 'Trend', 'Cycles', 'Fluctuations'.
If a segment column is given, a pandas DataFrame which contains the model components for each segment.
Examples
>>> model.get_model_components() { "Trend": "Linear(TIME)", "Cycles": None, "Fluctuations": "AR(36)" }
- get_performance_metrics()
Returns the performance metrics of the model. The metrics are provided for each forecast horizon.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary in which each metric is associated with a list containing <horizon> elements.
If a segment column is given, a pandas DataFrame which contains the metric values for each segment.
Examples
A model is trained with 4 horizons. The returned value will be:
>>> model.get_performance_metrics() {'MAPE': [ 0.1529961017445385, 0.1538823292343699, 0.1564376267423695, 0.15170398377407046}
- get_horizon_wide_metric(metric_name='MAPE')
Returns value of performance metric (MAPE, sMAPE, ...) averaged on the forecast horizon.
- Parameters
- metric_name: str
Default value equals 'MAPE'. Possible values: 'MAPE', 'MPE', 'MeanAbsoluteError', 'RootMeanSquareError', 'SMAPE', 'L1', 'L2', 'P2', 'R2', 'U2'
- Returns
- Float or pandas DataFrame
If no segment column is given, the average metric value on the forecast horizon. It is based on validation partition.
If a segment column is given, a pandas DataFrame which contains the average metric value on the forecast horizon for each segment.
- load_model(schema_name, table_name, oid=None)
Loads the model from a table.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr, optional
If the table contains several models, the OID must be given as an identifier. If it is not provided, the whole table is read.
Notes
Before using a reloaded model for a new prediction, set the following parameters again: 'time_column_name', 'target'. The SAP HANA ML library needs these parameters to prepare the dataset view. Otherwise, methods such as forecast() and predict() will fail.
Examples
>>> # Sets time_column_name and target again >>> model = AutoTimeSeries(conn_context=CONN, time_column_name='Date', target='Cash') >>> model.load_model(schema_name='MY_SCHEMA', table_name='MY_MODEL_TABLE') >>> model.predict(hana_df, ... apply_horizon=(NB_HORIZON_TRAIN + 5), ... apply_last_time_point=LAST_TRAIN_DATE)
- export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)
Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.
- Parameters
- code_type: str
The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).
- key: str, optional
The name of the primary key column. Required for some code types.
- label: str, optional
The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.
- schema_name: str, optional
The schema name of the apply-in table. Required for some code types.
- table_name: str, optional
The apply-in table name. Required for some code types.
- other_params: dict, optional
The additional parameters to be included in the configuration. The available parameters are given in the developer guide.
- Returns
- The exported code: str
Examples
Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)
>>> json_export = model.export_apply_code('JSON')
APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.
Exporting SQL apply code (available for Robust Regression and Clustering)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS')
Exporting SQL apply code (probability generated in the output)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS', ... other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings', ... 'APL/ApplyProba': 'true'})
- build_report(segment_name=None)
Build model report.
- Parameters
- segment_name: str, optional
If the model is segmented, the segment name for which the report will be built.
- generate_html_report(filename)
Save model report as a html file.
- Parameters
- filenamestr
Html file name.
- generate_notebook_iframe_report()
Render model report as a notebook iframe.
- disable_hana_execution()
HANA execution will be disabled and only SQL script will be generated.
- enable_hana_execution()
HANA execution will be enabled.
- get_apl_version()
Gets the version and configuration information about the installation of SAP HANA APL.
- Returns
- A pandas Dataframe with detailed information about the current version.
Notes
Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.
- get_artifacts_recorder()
Return the object recorder (for Design-time artifacts generation)
- get_debrief_report(report_name)
Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.
- Parameters
- report_name: str
- Returns
- Statistical report: hana_ml DataFrame
- get_fit_operation_log()
Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
- get_indicators()
Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
- get_model_info()
Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.
- Returns
- list
List of HANA DataFrames respectively corresponding to the following tables:
Summary table
Variable roles table
Variable description table
Indicators table
Profit curves table
- get_params()
Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
- get_predict_operation_log()
Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
- get_summary()
Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
- is_fitted()
Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
- save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)
Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact(artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace')
- save_model(schema_name, table_name, if_exists='fail', new_oid=None)
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
The model is saved into a table with the following columns:
"OID" NVARCHAR(50), -- Serve as ID
"FORMAT" NVARCHAR(50), -- APL technical info
"LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model
- schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)
Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- output_table_name_model: str
The output table name for the model binary.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- output_table_name_indicators: str
The output table name for the model indicators.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.schedule_fit(
output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)
Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- input_table_name_model: str
The input table name for the model binary.
- output_table_name_applyout: str
The output table name for the prediction data.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.predict(data)
- >>> model.schedule_predict(
output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)
Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.
- Parameters
- route_tostr, optional
Routes the query to the specified volume ID or service type.
Defaults to None.
- no_route_tostr or list of str, optional
Avoids query routing to a specified volume ID or service type.
Defaults to None.
- route_bystr, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s).
Defaults to None.
- route_by_cardinalitystr or list of str, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.
- data_transfer_costint, optional
Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.
Defaults to None.
- route_optimization_level{'minimal', 'all'}, optional
Guides the optimizer to compile with
route_optimization_level
'minimal' or to default toroute_optimization_level
. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.- workload_classstr, optional
Routes the query via workload class.
route_to
statement hint has higher precedence thanworkload_class
statement hint.Defaults to None.
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier >>> model = GradientBoostingBinaryClassifier() >>> # Routes the execution to a specific volume ID. >>> model.set_scale_out(route_to=1025) >>> # Routes the execution to a specific service type. >>> # model.set_scale_out(route_to='computeserver') >>> # Maps the execution to a specific workload class. >>> # model.set_scale_out(workload_class="WC4") >>> # Activates the sql trace >>> # connection_context.sql_tracer.enable_sql_trace(True) >>> model.fit(data=hdb_df, key='KEY', label='Y') >>> # You can check whether the queries were effectively routed by querying: >>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.
hana_ml.algorithms.apl.classification
Deprecated, use hana_ml.algorithms.apl.gradient_boosting_classification instead.
This module provides the SAP HANA APL binary classification algorithm.
The following classes are available:
- class hana_ml.algorithms.apl.classification.AutoClassifier(conn_context=None, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)
Bases:
RobustRegressionBase
Deprecated, use
GradientBoostingBinaryClassifier
instead.Legacy SAP HANA APL Binary Classifier algorithm.
- Parameters
- conn_contextConnectionContext, optional
The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().
- variable_auto_selectionbool, optional
When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables.
- polynomial_degreeint, optional
The polynomial degree of the model. Default is 1.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value type (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.
- extra_applyout_settings: dict optional
Defines other outputs the model should generate in addition to the predicted values. For example: {'APL/ApplyReasonCode':'3;Mean;Below;False'} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Developer Guide.
- other_params: dict optional
Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
'max_tasks'
'segment_column_name'
'correlations_lower_bound'
'correlations_max_kept'
'cutting_strategy'
'exclude_low_predictive_confidence'
'risk_fitting'
'risk_fitting_min_cumulated_frequency'
'risk_fitting_nb_pdo'
'risk_fitting_use_weights'
'risk_gdo'
'risk_mode'
'risk_pdo'
'risk_score'
'score_bins_count'
'target_key'
'variable_selection_best_iteration'
'variable_selection_min_nb_of_final_variables'
'variable_selection_max_nb_of_final_variables'
'variable_selection_mode'
'variable_selection_nb_variables_removed_by_step'
'variable_selection_percentage_of_contribution_kept_by_step'
'variable_selection_quality_bar'
'variable_selection_quality_criteria'
See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.
For 'max_tasks', see FUNC_HEADER.
- other_train_apl_aliases: dict, optional
Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value. There is no control in python.
Notes
It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.
By default, when it is not given, SAP HANA APL guesses the variable description by reading the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values in these parameters, the user can overwrite the default guess. For example:
model.set_params(variable_storages={ 'ID': 'integer', 'sepal length (cm)': 'number' }) model.set_params(variable_value_types={ 'sepal length (cm)': 'continuous' }) model.set_params(variable_missing_strings={ 'sepal length (cm)': '-1' }) model.set_params(extra_applyout_settings={ 'APL/ApplyReasonCode': '3;Mean;Below;False' })
Examples
>>> from hana_ml.algorithms.apl.classification import AutoClassifier >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext(HDB_HOST, HDB_PORT, HDB_USER, HDB_PASS) >>> # -- Creates hana_ml DataFrame >>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = AutoClassifier(variable_auto_selection=True) >>> model.fit(hana_df, label='class', key='id')
Making the predictions
>>> apply_out = model.predict(hana_df) >>> print(apply_out.head(3).collect()) id TRUE_LABEL PREDICTED PROBABILITY 0 30 0 0 0.688153 1 63 0 0 0.677693 2 66 0 0 0.700221
Adding individual contributions to the output of predictions
>>> model.set_params(extra_applyout_settings={'APL/ApplyContribution': 'all'}) >>> apply_out = model.predict(hana_df) >>> print(apply_out.head(3).collect()) id TRUE_LABEL PREDICTED PROBABILITY contrib_age_rr_class ... 0 30 0 0 0.688153 0.043387 ... 1 63 0 0 0.677693 0.042608 ... 2 66 0 0 0.700221 0.020784 ...
Adding reason codes to the output of predictions
>>> model.set_params(extra_applyout_settings={'APL/ApplyReasonCode': '3;Mean;Below;False'}) >>> apply_out = model.predict(hana_df) >>> print(apply_out.head(3).collect()) id TRUE_LABEL PREDICTED PROBABILITY RCN_B_Mean_1_rr_class ... 0 30 0 0 0.688153 education-num ... 1 63 0 0 0.677693 education-num ... 2 66 0 0 0.700221 education-num ...
Debriefing
>>> model.get_performance_metrics() OrderedDict([('L1', 0.2522171212463023), ('L2', 0.32254434028379236), ...
>>> model.get_feature_importances() OrderedDict([('marital-status', 0.2172766583204266), ('capital-gain', 0.19521247617062215),...
Saving the model in the schema named 'MODEL_STORAGE'. Please see model_storage class for further features of model storage.
>>> from hana_ml.model_storage import ModelStorage >>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My classification model name' >>> model_storage.save_model(model=model, if_exists='replace')
Exporting the SQL apply code
>>> sql = model.export_apply_code(code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS')
- Attributes
- model_hana_ml DataFrame
The trained model content
- summary_APLArtifactTable
The reference to the "SUMMARY" table generated by the model training. This table contains the content of the model training summary.
- indicators_APLArtifactTable
The reference to the "INDICATORS" table generated by the model training. This table contains various metrics related to model and model variables.
- fit_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table generated by the model training
- var_desc_APLArtifactTable
The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training
- applyout_hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table when a prediction was made
Methods
HANA execution will be disabled and only SQL script will be generated.
HANA execution will be enabled.
export_apply_code
(code_type[, key, label, ...])Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.
fit
(data[, key, features, label, weight])Fits the model.
Gets the version and configuration information about the installation of SAP HANA APL.
Return the object recorder (for Design-time artifacts generation)
get_debrief_report
(report_name)Retrieves a standard statistical report.
Returns the feature importances (MaximumSmartVariableContribution).
Retrieves the operation log table after the model training.
Retrieves the Indicator table after model training.
Get information about an existing model.
Retrieves attributes of the current object.
Returns the performance metrics of the last trained model.
Retrieves the operation log table after the model training.
Retrieves the summary table after model training.
Checks if the model can be saved.
load_model
(schema_name, table_name[, oid])Loads the model from a table.
predict
(data)Makes predictions with the fitted model.
save_artifact
(artifact_df, schema_name, ...)Saves an artifact, a temporary table, into a permanent table.
save_model
(schema_name, table_name[, ...])Saves the model into a table.
schedule_fit
(output_table_name_model, ...)Creates a HANA scheduler job for the model fitting.
schedule_predict
(output_table_name_applyout, ...)Creates a HANA scheduler job for the model prediction.
score
(data)Returns the accuracy score on the provided test dataset.
set_params
(**parameters)Sets attributes of the current model.
set_scale_out
([route_to, no_route_to, ...])Specifies hints for scaling-out environment.
- fit(data, key=None, features=None, label=None, weight=None)
Fits the model.
- Parameters
- dataDataFrame
The training dataset
- keystr, optional
The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended usage. See notes below.
- featureslist of str, optional
The names of the features to be used in the model. If features is not provided, all non-ID and non-label columns will be taken.
- labelstr, optional
The name of the label column. Default is the last column.
- weightstr, optional
The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.
- Returns
- selfobject
Notes
It is highly recommended to use a dataset with key in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with key, because the model will not expect it.
- predict(data)
Makes predictions with the fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.
- Parameters
- datahana_ml DataFrame
The dataset used for prediction
- Returns
- Prediction output: hana_ml DataFrame
The dataframe contains the following columns:
KEY : the key column if it was provided in the dataset
TRUE_LABEL : the class label when it was given in the dataset
PREDICTED : the predicted label
PROBABILITY : the probability of the predicted label to be correct (confidence)
SCORING_VALUE : the unnormalized scoring value
- score(data)
Returns the accuracy score on the provided test dataset.
- Parameters
- data: hana_ml DataFrame
The test dataset used to compute the score. The labels must be provided in the dataset.
- Returns
- Float or pandas DataFrame
If no segment column is given, the accuracy score.
If a segment column is given, a pandas DataFrame which contains the accuracy score for each segment.
- disable_hana_execution()
HANA execution will be disabled and only SQL script will be generated.
- enable_hana_execution()
HANA execution will be enabled.
- export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)
Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.
- Parameters
- code_type: str
The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).
- key: str, optional
The name of the primary key column. Required for some code types.
- label: str, optional
The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.
- schema_name: str, optional
The schema name of the apply-in table. Required for some code types.
- table_name: str, optional
The apply-in table name. Required for some code types.
- other_params: dict, optional
The additional parameters to be included in the configuration. The available parameters are given in the developer guide.
- Returns
- The exported code: str
Examples
Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)
>>> json_export = model.export_apply_code('JSON')
APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.
Exporting SQL apply code (available for Robust Regression and Clustering)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS')
Exporting SQL apply code (probability generated in the output)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS', ... other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings', ... 'APL/ApplyProba': 'true'})
- get_apl_version()
Gets the version and configuration information about the installation of SAP HANA APL.
- Returns
- A pandas Dataframe with detailed information about the current version.
Notes
Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.
- get_artifacts_recorder()
Return the object recorder (for Design-time artifacts generation)
- get_debrief_report(report_name)
Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.
- Parameters
- report_name: str
- Returns
- Statistical report: hana_ml DataFrame
- get_feature_importances()
Returns the feature importances (MaximumSmartVariableContribution).
- Returns
- OrderedDict or pandas DataFrame
If no segment column is given, an OrderedDict: { feature_name : value }.
If a segment column is given, a pandas DataFrame which contains the feature importances for each segment.
- get_fit_operation_log()
Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
- get_indicators()
Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
- get_model_info()
Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.
- Returns
- list
List of HANA DataFrames respectively corresponding to the following tables:
Summary table
Variable roles table
Variable description table
Indicators table
Profit curves table
- get_params()
Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
- get_performance_metrics()
Returns the performance metrics of the last trained model.
- Returns
- OrderedDict or pandas DataFrame
If no segment column is given, an OrderedDict with metric name as key and metric value as value.
If a segment column is given, a pandas DataFrame which contains the performance metrics for each segment.
- get_predict_operation_log()
Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
- get_summary()
Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
- is_fitted()
Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
- load_model(schema_name, table_name, oid=None)
Loads the model from a table.
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
- save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)
Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact(artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace')
- save_model(schema_name, table_name, if_exists='fail', new_oid=None)
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
The model is saved into a table with the following columns:
"OID" NVARCHAR(50), -- Serve as ID
"FORMAT" NVARCHAR(50), -- APL technical info
"LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model
- schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)
Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- output_table_name_model: str
The output table name for the model binary.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- output_table_name_indicators: str
The output table name for the model indicators.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.schedule_fit(
output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)
Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- input_table_name_model: str
The input table name for the model binary.
- output_table_name_applyout: str
The output table name for the prediction data.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.predict(data)
- >>> model.schedule_predict(
output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- set_params(**parameters)
Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.
- Parameters
- paramsdictionary
The attribute names and values
- set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)
Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.
- Parameters
- route_tostr, optional
Routes the query to the specified volume ID or service type.
Defaults to None.
- no_route_tostr or list of str, optional
Avoids query routing to a specified volume ID or service type.
Defaults to None.
- route_bystr, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s).
Defaults to None.
- route_by_cardinalitystr or list of str, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.
- data_transfer_costint, optional
Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.
Defaults to None.
- route_optimization_level{'minimal', 'all'}, optional
Guides the optimizer to compile with
route_optimization_level
'minimal' or to default toroute_optimization_level
. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.- workload_classstr, optional
Routes the query via workload class.
route_to
statement hint has higher precedence thanworkload_class
statement hint.Defaults to None.
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier >>> model = GradientBoostingBinaryClassifier() >>> # Routes the execution to a specific volume ID. >>> model.set_scale_out(route_to=1025) >>> # Routes the execution to a specific service type. >>> # model.set_scale_out(route_to='computeserver') >>> # Maps the execution to a specific workload class. >>> # model.set_scale_out(workload_class="WC4") >>> # Activates the sql trace >>> # connection_context.sql_tracer.enable_sql_trace(True) >>> model.fit(data=hdb_df, key='KEY', label='Y') >>> # You can check whether the queries were effectively routed by querying: >>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.
hana_ml.algorithms.apl.regression
Deprecated, use hana_ml.algorithms.apl.gradient_boosting_regression instead.
This module contains SAP HANA APL regression algorithm.
The following classes are available:
- class hana_ml.algorithms.apl.regression.AutoRegressor(conn_context=None, variable_auto_selection=True, polynomial_degree=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)
Bases:
RobustRegressionBase
Deprecated, use
GradientBoostingRegressor
instead.Legacy SAP HANA APL regression algorithm.
- Parameters
- conn_contextConnectionContext, optional
The connection object to an SAP HANA database. This parameter is not needed anymore. It will be set automatically when a dataset is used in fit() or predict().
- variable_auto_selectionbool optional
When set to True, variable auto-selection is activated. Variable auto-selection enables to maintain the performance of a model while keeping the lowest number of variables
- polynomial_degreeint optional
The polynomial degree of the model. Default is 1.
- variable_storages: dict optional
Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.
- variable_value_types: dict optional
Specifies the variable value type (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals to '???', it will be taken as missing.
- extra_applyout_settings: dict optional
Defines other outputs the model should generate in addition to the predicted values. For example: {'APL/ApplyReasonCode':'3;Mean;Below;False'} will add reason codes in the output when the model is applied. These reason codes provide explanation about the prediction. See OPERATION_CONFIG parameters in APPLY_MODEL function, SAP HANA APL Developer Guide.
- other_params: dict optional
Corresponds to advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
'max_tasks'
'segment_column_name'
'correlations_lower_bound'
'correlations_max_kept'
'cutting_strategy'
'exclude_low_predictive_confidence'
'risk_fitting'
'risk_fitting_min_cumulated_frequency'
'risk_fitting_nb_pdo'
'risk_fitting_use_weights'
'risk_gdo'
'risk_mode'
'risk_pdo'
'risk_score'
'score_bins_count'
'variable_auto_selection'
'variable_selection_best_iteration'
'variable_selection_min_nb_of_final_variables'
'variable_selection_max_nb_of_final_variables'
'variable_selection_mode'
'variable_selection_nb_variables_removed_by_step'
'variable_selection_percentage_of_contribution_kept_by_step'
'variable_selection_quality_bar'
'variable_selection_quality_criteria'
See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.
For 'max_tasks', see FUNC_HEADER.
- other_train_apl_aliases: dict, optional
Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value. There is no control in python.
Notes
It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.
Examples
>>> from hana_ml.algorithms.apl.regression import AutoRegressor >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA Database
>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS') >>> # -- Creates Hana DataFrame >>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = AutoRegressor(variable_auto_selection=True) >>> model.fit(hana_df, label='age', key='id' features=['workclass', ... 'fnlwgt', ... 'education', ... 'education-num', ... 'marital-status'])
Making a prediction
>>> applyout_df = model.predict(hana_df) >>> print(applyout_df.head(5).collect()) id TRUE_LABEL PREDICTED 0 30 49 42 1 63 48 42 2 66 36 42 3 110 42 42 4 335 53 42
Debriefing
>>> model.get_performance_metrics() OrderedDict([('L1', 8.59885654599923), ('L2', 11.012352163260505)...
>>> model.get_feature_importances() OrderedDict([('marital-status', 0.7916100739306074), ('education-num', 0.13524836400650087)
Saving the model in the schema named 'MODEL_STORAGE'. Please see model_storage class for further features of model storage
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My regression model name' >>> model_storage.save_model(model=model, if_exists='replace')
Reloading the model and making another prediction
>>> model2 = AutoRegressor(conn_context=CONN) >>> model2.load_model(schema_name='MySchema', table_name='MyTable')
>>> applyout2 = model2.predict(hana_df) >>> applyout2.head(5).collect() id TRUE_LABEL PREDICTED 0 30 49 42 1 63 48 42 2 66 36 42 3 110 42 42 4 335 53 42
Exporting the SQL apply code
>>> sql = model.export_apply_code(code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS')
Methods
HANA execution will be disabled and only SQL script will be generated.
HANA execution will be enabled.
export_apply_code
(code_type[, key, label, ...])Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.
fit
(data[, key, features, label, weight])Fits the model.
Gets the version and configuration information about the installation of SAP HANA APL.
Return the object recorder (for Design-time artifacts generation)
get_debrief_report
(report_name)Retrieves a standard statistical report.
Returns the feature importances (MaximumSmartVariableContribution).
Retrieves the operation log table after the model training.
Retrieves the Indicator table after model training.
Get information about an existing model.
Retrieves attributes of the current object.
Returns the performance metrics of the last trained model.
Retrieves the operation log table after the model training.
Retrieves the summary table after model training.
Checks if the model can be saved.
load_model
(schema_name, table_name[, oid])Loads the model from a table.
predict
(data)Makes prediction with a fitted model.
save_artifact
(artifact_df, schema_name, ...)Saves an artifact, a temporary table, into a permanent table.
save_model
(schema_name, table_name[, ...])Saves the model into a table.
schedule_fit
(output_table_name_model, ...)Creates a HANA scheduler job for the model fitting.
schedule_predict
(output_table_name_applyout, ...)Creates a HANA scheduler job for the model prediction.
score
(data)Returns the R2 score (coefficient of determination) on the provided test dataset.
set_params
(**parameters)Sets attributes of the current model.
set_scale_out
([route_to, no_route_to, ...])Specifies hints for scaling-out environment.
- fit(data, key=None, features=None, label=None, weight=None)
Fits the model.
- Parameters
- dataDataFrame
The training dataset
- keystr, optional
The name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- featureslist of str, optional
Names of the feature columns. If features is not provided, default will be to all the non-ID and non-label columns.
- labelstr, optional
The name of the label column. Default is the last column.
- weightstr, optional
The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.
- Returns
- selfobject
Notes
It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a dataset with a key, because the model will not expect it.
- predict(data)
Makes prediction with a fitted model. It is possible to add special outputs, such as reason codes, by specifying extra_applyout_setting parameter in the model. This parameter is explained above in the model class section.
- Parameters
- datahana_ml DataFrame
The dataset used for prediction
- Returns
- Prediction output: a hana_ml DataFrame.
The dataframe contains the following columns:
KEY : the key column if it was provided in the dataset
TRUE_LABEL : the true value if it was provided in the dataset
PREDICTED : the predicted value
- score(data)
Returns the R2 score (coefficient of determination) on the provided test dataset.
- Parameters
- data: hana_ml DataFrame
The test dataset used to compute the score. The labels must be provided in the dataset.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, the R2 score.
If a segment column is given, a pandas DataFrame which contains the R2 score for each segment.
- disable_hana_execution()
HANA execution will be disabled and only SQL script will be generated.
- enable_hana_execution()
HANA execution will be enabled.
- export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)
Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.
- Parameters
- code_type: str
The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).
- key: str, optional
The name of the primary key column. Required for some code types.
- label: str, optional
The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.
- schema_name: str, optional
The schema name of the apply-in table. Required for some code types.
- table_name: str, optional
The apply-in table name. Required for some code types.
- other_params: dict, optional
The additional parameters to be included in the configuration. The available parameters are given in the developer guide.
- Returns
- The exported code: str
Examples
Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)
>>> json_export = model.export_apply_code('JSON')
APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.
Exporting SQL apply code (available for Robust Regression and Clustering)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS')
Exporting SQL apply code (probability generated in the output)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS', ... other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings', ... 'APL/ApplyProba': 'true'})
- get_apl_version()
Gets the version and configuration information about the installation of SAP HANA APL.
- Returns
- A pandas Dataframe with detailed information about the current version.
Notes
Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.
- get_artifacts_recorder()
Return the object recorder (for Design-time artifacts generation)
- get_debrief_report(report_name)
Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.
- Parameters
- report_name: str
- Returns
- Statistical report: hana_ml DataFrame
- get_feature_importances()
Returns the feature importances (MaximumSmartVariableContribution).
- Returns
- OrderedDict or pandas DataFrame
If no segment column is given, an OrderedDict: { feature_name : value }.
If a segment column is given, a pandas DataFrame which contains the feature importances for each segment.
- get_fit_operation_log()
Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
- get_indicators()
Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
- get_model_info()
Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.
- Returns
- list
List of HANA DataFrames respectively corresponding to the following tables:
Summary table
Variable roles table
Variable description table
Indicators table
Profit curves table
- get_params()
Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
- get_performance_metrics()
Returns the performance metrics of the last trained model.
- Returns
- OrderedDict or pandas DataFrame
If no segment column is given, an OrderedDict with metric name as key and metric value as value.
If a segment column is given, a pandas DataFrame which contains the performance metrics for each segment.
- get_predict_operation_log()
Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
- get_summary()
Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
- is_fitted()
Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
- load_model(schema_name, table_name, oid=None)
Loads the model from a table.
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
- save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)
Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact(artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace')
- save_model(schema_name, table_name, if_exists='fail', new_oid=None)
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
The model is saved into a table with the following columns:
"OID" NVARCHAR(50), -- Serve as ID
"FORMAT" NVARCHAR(50), -- APL technical info
"LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model
- schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)
Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- output_table_name_model: str
The output table name for the model binary.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- output_table_name_indicators: str
The output table name for the model indicators.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.schedule_fit(
output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)
Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- input_table_name_model: str
The input table name for the model binary.
- output_table_name_applyout: str
The output table name for the prediction data.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.predict(data)
- >>> model.schedule_predict(
output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- set_params(**parameters)
Sets attributes of the current model. This method is implemented for compatibility with Scikit Learn.
- Parameters
- paramsdictionary
The attribute names and values
- set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)
Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.
- Parameters
- route_tostr, optional
Routes the query to the specified volume ID or service type.
Defaults to None.
- no_route_tostr or list of str, optional
Avoids query routing to a specified volume ID or service type.
Defaults to None.
- route_bystr, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s).
Defaults to None.
- route_by_cardinalitystr or list of str, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.
- data_transfer_costint, optional
Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.
Defaults to None.
- route_optimization_level{'minimal', 'all'}, optional
Guides the optimizer to compile with
route_optimization_level
'minimal' or to default toroute_optimization_level
. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.- workload_classstr, optional
Routes the query via workload class.
route_to
statement hint has higher precedence thanworkload_class
statement hint.Defaults to None.
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier >>> model = GradientBoostingBinaryClassifier() >>> # Routes the execution to a specific volume ID. >>> model.set_scale_out(route_to=1025) >>> # Routes the execution to a specific service type. >>> # model.set_scale_out(route_to='computeserver') >>> # Maps the execution to a specific workload class. >>> # model.set_scale_out(workload_class="WC4") >>> # Activates the sql trace >>> # connection_context.sql_tracer.enable_sql_trace(True) >>> model.fit(data=hdb_df, key='KEY', label='Y') >>> # You can check whether the queries were effectively routed by querying: >>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.
hana_ml.algorithms.apl.clustering
This module provides the SAP HANA APL clustering algorithms.
The following classes are available:
- class hana_ml.algorithms.apl.clustering.AutoUnsupervisedClustering(conn_context=None, nb_clusters=None, nb_clusters_min=None, nb_clusters_max=None, distance=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)
Bases:
_AutoClusteringBase
SAP HANA APL unsupervised clustering algorithm.
- Parameters
- nb_clustersint, optional, default = 10
The number of clusters to create
- nb_clusters_min: int, optional
The minimum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.
- nb_clusters_max: int, optional
The maximum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.
- distance: str, optional, default = 'SystemDetermined'
The metric used to measure the distance between data points. The possible values are: 'L1', 'L2', 'LInf', 'SystemDetermined'.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value types (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals '???', it will be taken as missing.
- extra_applyout_settings: dict optional
Defines the output to generate when applying the model. See documentation on predict() method for more information.
- other_params: dict optional
Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
'max_tasks'
'segment_column_name'
'calculate_cross_statistics'
'calculate_sql_expressions'
'cutting_strategy'
'encoding_strategy'
See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.
For 'max_tasks', see FUNC_HEADER.
- other_train_apl_aliases: dict, optional
Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value in 'other_train_apl_aliases'. There is no control in python.
Notes
The algorithm may detect less clusters than requested. This happens when a cluster detected on the estimation dataset was not found on the validation dataset. In that case, this cluster will be considered unstable and will then be removed from the model. Users can get the number of clusters actually found in the "INDICATORS" table. For example,
# The actual number of clusters found d = model_u.get_indicators().collect() d[d.KEY=='FinalClusterCount'][['KEY','VALUE']]
It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.
By default, when it is not given, SAP HANA APL guesses the variable description by reading the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values for these parameters, the user can overwrite the default guess. For example:
model.set_params(variable_storages={ 'ID': 'integer', 'sepal length (cm)': 'number' }) model.set_params(variable_value_types={ 'sepal length (cm)': 'continuous' }) model.set_params(variable_missing_strings={ 'sepal length (cm)': '-1' })
Examples
>>> from hana_ml.algorithms.apl.clustering import AutoUnsupervisedClustering >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS') >>> # -- Creates Hana DataFrame >>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = AutoUnsupervisedClustering(CONN, nb_clusters=5) >>> model.fit(data=hana_df, key='id')
Debriefing
>>> model.get_metrics() OrderedDict([('SimplifiedSilhouette', 0.3448029020802121), ('RSS', 4675.706587754118),...
>>> model.get_metrics_by_cluster() {'Frequency': {1: 0.23053242076908276, 2: 0.27434649954646656, 3: 0.09628652318517908, 4: 0.29919463456199663, 5: 0.09963992193727494}, 'IntraInertia': {1: 0.6734978174937322, 2: 0.7202839995396123, 3: 0.5516800856975772, 4: 0.6969632183111357, 5: 0.5809322138167139}, 'RSS': {1: 5648.626195319932, 2: 7189.15459940487, 3: 1932.5353401986129, 4: 7586.444631316713, 5: 2105.879275085588}, 'SimplifiedSilhouette': {1: 0.1383827622819234, 2: 0.14716862328457128, 3: 0.18753797605134545, 4: 0.13679980173383793, 5: 0.15481377834381388}, 'KL': {1: OrderedDict([('relationship', 0.4951910610641741), ('marital-status', 0.2776259711735807), ('hours-per-week', 0.20990189265572687), ('education-num', 0.1996353893520096), ('education', 0.19963538935200956), ...
Predicting which cluster a data point belongs to
>>> applyout_df = model.predict(hana_df) >>> applyout_df.collect() # returns the output as a pandas DataFrame id CLOSEST_CLUSTER_1 DISTANCE_TO_CLOSEST_CENTROID_1 0 30 3 0.640378 1 63 4 0.611050 2 66 3 0.640378 3 110 4 0.611050 4 335 1 0.851054
Determining the 2 closest clusters
>>> model.set_params(extra_applyout_settings={'mode':'closest_distances', 'nb_distances': 2}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect() # returns the output as a pandas DataFrame id CLOSEST_CLUSTER_1 ... CLOSEST_CLUSTER_2 DISTANCE_TO_CLOSEST_CENTROID_2 0 30 3 ... 4 0.730330 1 63 4 ... 1 0.851054 2 66 3 ... 4 0.730330 3 110 4 ... 1 0.851054 4 335 1 ... 4 0.906003
Retrieving the distances to all clusters
>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect() # returns the output as a pandas DataFrame id DISTANCE_TO_CENTROID_1 ... DISTANCE_TO_CENTROID_5 0 30 3 ... 1.160697 1 63 4 ... 1.160697 2 66 3 ... 1.160697
Saving the model in the schema named 'MODEL_STORAGE'. Please model_storage class for further features of model storage.
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My model name' >>> model_storage.save_model(model=model)
Reloading the model for further use
>>> model2 = AutoUnsupervisedClustering(conn_context=CONN) >>> model2.load_model(schema_name='MySchema', table_name='MyTable') >>> applyout2 = model2.predict(hana_df) >>> applyout2.head(3).collect() id CLOSEST_CLUSTER_1 DISTANCE_TO_CLOSEST_CENTROID_1 0 30 3 0.640378 1 63 4 0.611050 2 66 3 0.640378
Exporting the SQL apply code
>>> model = AutoUnsupervisedClustering(CONN, nb_clusters=5, calculate_sql_expressions='enabled') >>> model.fit(data=hana_df, key='id') >>> sql = model.export_apply_code(code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS')
- Attributes
- model_hana_ml DataFrame
The trained model content
- summary_APLArtifactTable
The reference to the "SUMMARY" table generated by the model training. This table contains the summary about the model training.
- indicators_APLArtifactTable
The reference to the "INDICATORS" table generated by the model training. This table contains the various metrics related to the model and its variables.
- fit_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table generated by the model training
- var_desc_APLArtifactTable
The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training
- applyout_hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table when a prediction was made
Methods
HANA execution will be disabled and only SQL script will be generated.
HANA execution will be enabled.
export_apply_code
(code_type[, key, label, ...])Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.
fit
(data[, key, features, weight])Fits the model.
fit_predict
(data[, key, features, weight])Fits a clustering model and uses it to generate prediction output on the training dataset.
Gets the version and configuration information about the installation of SAP HANA APL.
Return the object recorder (for Design-time artifacts generation)
get_debrief_report
(report_name)Retrieves a standard statistical report.
Retrieves the operation log table after the model training.
Retrieves the Indicator table after model training.
Returns metrics about the model.
Get information about an existing model.
Retrieves attributes of the current object.
Retrieves the operation log table after the model training.
Retrieves the summary table after model training.
Checks if the model can be saved.
load_model
(schema_name, table_name[, oid])Loads the model from a table.
predict
(data)Predicts which cluster each specified row belongs to.
save_artifact
(artifact_df, schema_name, ...)Saves an artifact, a temporary table, into a permanent table.
save_model
(schema_name, table_name[, ...])Saves the model into a table.
schedule_fit
(output_table_name_model, ...)Creates a HANA scheduler job for the model fitting.
schedule_predict
(output_table_name_applyout, ...)Creates a HANA scheduler job for the model prediction.
set_params
(**parameters)Sets attributes of the current model.
set_scale_out
([route_to, no_route_to, ...])Specifies hints for scaling-out environment.
- fit(data, key=None, features=None, weight=None)
Fits the model.
- Parameters
- datahana_ml DataFrame
The training dataset
- keystr, optional
The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended.
- featureslist of str, optional
The names of the features to be used in the model. If features is not provided, all columns will be taken except the ID column.
- weightstr, optional
The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.
- Returns
- selfobject
- fit_predict(data, key=None, features=None, weight=None)
Fits a clustering model and uses it to generate prediction output on the training dataset.
- Parameters
- datahana_ml DataFrame
The input dataset
- keystr, optional
The name of the ID column.
- featureslist of str, optional.
The names of the feature columns. If features is not provided, all non-ID columns will be taken.
- weightstr, optional
The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.
- Returns
- hana_ml DataFrame.
The output is the same as the predict() method.
Notes
Please see the predict() method so as to get different outputs with the 'extra_applyout_settings' parameter.
- get_metrics()
Returns metrics about the model.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary object containing a set of clustering metrics and their values.
If a segment column is given, a pandas DataFrame which contains the metrics for each segment.
Examples
>>> model.get_metrics() {'SimplifiedSilhouette': 0.14668968897882997, 'RSS': 24462.640041325714, 'IntraInertia': 3.2233573348587714, 'KL': {1: OrderedDict([('hours-per-week', 0.2971627592049324), ('occupation', 0.11944355994892383), ('relationship', 0.06772624975990414), ('education-num', 0.06377345492340795), ('education', 0.06377345492340793), ...}
- disable_hana_execution()
HANA execution will be disabled and only SQL script will be generated.
- enable_hana_execution()
HANA execution will be enabled.
- export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)
Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.
- Parameters
- code_type: str
The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).
- key: str, optional
The name of the primary key column. Required for some code types.
- label: str, optional
The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.
- schema_name: str, optional
The schema name of the apply-in table. Required for some code types.
- table_name: str, optional
The apply-in table name. Required for some code types.
- other_params: dict, optional
The additional parameters to be included in the configuration. The available parameters are given in the developer guide.
- Returns
- The exported code: str
Examples
Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)
>>> json_export = model.export_apply_code('JSON')
APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.
Exporting SQL apply code (available for Robust Regression and Clustering)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS')
Exporting SQL apply code (probability generated in the output)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS', ... other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings', ... 'APL/ApplyProba': 'true'})
- get_apl_version()
Gets the version and configuration information about the installation of SAP HANA APL.
- Returns
- A pandas Dataframe with detailed information about the current version.
Notes
Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.
- get_artifacts_recorder()
Return the object recorder (for Design-time artifacts generation)
- get_debrief_report(report_name)
Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.
- Parameters
- report_name: str
- Returns
- Statistical report: hana_ml DataFrame
- get_fit_operation_log()
Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
- get_indicators()
Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
- get_model_info()
Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.
- Returns
- list
List of HANA DataFrames respectively corresponding to the following tables:
Summary table
Variable roles table
Variable description table
Indicators table
Profit curves table
- get_params()
Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
- get_predict_operation_log()
Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
- get_summary()
Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
- is_fitted()
Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
- load_model(schema_name, table_name, oid=None)
Loads the model from a table.
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the oid must be given as an identifier. If it is not provided, the whole table is read.
- predict(data)
Predicts which cluster each specified row belongs to.
- Parameters
- datahana_ml DataFrame
The set of rows for which to generate cluster predictions. This dataset must have the same structure as the one used in the fit() method.
- Returns
- hana_ml DataFrame
By default, the ID of the closest cluster and the distance to its center are provided. Users can request different outputs by setting the extra_applyout_settings parameter in the model. The extra_applyout_settings parameter is a dictionary with 'mode' and 'nb_distances' as keys. If mode is set to 'closest_distances', cluster IDs and distances to centroids will be provided from the closest to the furthest cluster. The output columns will be:
<The key column name>,
CLOSEST_CLUSTER_1,
DISTANCE_TO_CLOSEST_CENTROID_1,
CLOSEST_CLUSTER_2,
DISTANCE_TO_CLOSEST_CENTROID_2,
...
If mode is set to 'all_distances', the distances to the centroids of all clusters will be provided in cluster ID order. The output columns will be:
ID,
DISTANCE_TO_CENTROID_1,
DISTANCE_TO_CENTROID_2,
...
nb_distances limits the output to the closest clusters. It is only valid when mode is 'closest_distances' (it will be ignored if mode = 'all distances'). It can be set to 'all' or a positive integer.
Examples
Retrieves the IDs of the 3 closest clusters and the distances to their centroids:
>>> extra_applyout_settings = {'mode': 'closest_distances', 'nb_distances': '3'} >>> model.set_params(extra_applyout_settings=extra_applyout_settings) >>> out = model.predict(hana_df) >>> out.head(3).collect() id CLOSEST_CLUSTER_1 ... CLOSEST_CLUSTER_3 DISTANCE_TO_CLOSEST_CENTROID_3 0 30 3 ... 4 0.730330 1 63 4 ... 1 0.851054 2 66 3 ... 4 0.730330
Retrieves the distances to all clusters:
>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'}) >>> out = model.predict(hana_df) >>> out.head(3).collect() id DISTANCE_TO_CENTROID_1 DISTANCE_TO_CENTROID_2 ... DISTANCE_TO_CENTROID_5 0 30 0.994595 0.877414 ... 0.782949 1 63 0.994595 0.985202 ... 0.782949 2 66 0.994595 0.877414 ... 0.782949
- save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)
Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact(artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace')
- save_model(schema_name, table_name, if_exists='fail', new_oid=None)
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
The model is saved into a table with the following columns:
"OID" NVARCHAR(50), -- Serve as ID
"FORMAT" NVARCHAR(50), -- APL technical info
"LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model
- schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)
Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- output_table_name_model: str
The output table name for the model binary.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- output_table_name_indicators: str
The output table name for the model indicators.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.schedule_fit(
output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)
Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- input_table_name_model: str
The input table name for the model binary.
- output_table_name_applyout: str
The output table name for the prediction data.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.predict(data)
- >>> model.schedule_predict(
output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- set_params(**parameters)
Sets attributes of the current model.
- Parameters
- paramsdictionary
The set of parameters with their new values
- set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)
Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.
- Parameters
- route_tostr, optional
Routes the query to the specified volume ID or service type.
Defaults to None.
- no_route_tostr or list of str, optional
Avoids query routing to a specified volume ID or service type.
Defaults to None.
- route_bystr, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s).
Defaults to None.
- route_by_cardinalitystr or list of str, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.
- data_transfer_costint, optional
Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.
Defaults to None.
- route_optimization_level{'minimal', 'all'}, optional
Guides the optimizer to compile with
route_optimization_level
'minimal' or to default toroute_optimization_level
. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.- workload_classstr, optional
Routes the query via workload class.
route_to
statement hint has higher precedence thanworkload_class
statement hint.Defaults to None.
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier >>> model = GradientBoostingBinaryClassifier() >>> # Routes the execution to a specific volume ID. >>> model.set_scale_out(route_to=1025) >>> # Routes the execution to a specific service type. >>> # model.set_scale_out(route_to='computeserver') >>> # Maps the execution to a specific workload class. >>> # model.set_scale_out(workload_class="WC4") >>> # Activates the sql trace >>> # connection_context.sql_tracer.enable_sql_trace(True) >>> model.fit(data=hdb_df, key='KEY', label='Y') >>> # You can check whether the queries were effectively routed by querying: >>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.
- class hana_ml.algorithms.apl.clustering.AutoSupervisedClustering(conn_context=None, label=None, nb_clusters=None, nb_clusters_min=None, nb_clusters_max=None, distance=None, variable_storages=None, variable_value_types=None, variable_missing_strings=None, extra_applyout_settings=None, **other_params)
Bases:
_AutoClusteringBase
SAP HANA APL Supervised Clustering algorithm. Clusters are determined with respect to a label variable.
- Parameters
- label: str,
The name of the label column
- nb_clustersint, optional, default = 10
The number of clusters to create
- nb_clusters_min: int, optional
The minimum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.
- nb_clusters_max: int, optional
The maximum number of clusters to create. If the nb_clusters parameter is set, it will be ignored.
- distance: str, optional, default = 'SystemDetermined'
The metric used to measure the distance between data points. The possible values are: 'L1', 'L2', 'LInf', 'SystemDetermined'.
- variable_storages: dict, optional
Specifies the variable data types (string, integer, number). For example, {'VAR1': 'string', 'VAR2': 'number'}. See notes below for more details.
- variable_value_types: dict, optional
Specifies the variable value types (continuous, nominal, ordinal). For example, {'VAR1': 'continuous', 'VAR2': 'nominal'}. See notes below for more details.
- variable_missing_strings: dict, optional
Specifies the variable values that will be taken as missing. For example, {'VAR1': '???'} means anytime the variable value equals '???', it will be taken as missing.
- extra_applyout_settings: dict optional
Defines the output to generate when applying the model. See documentation on predict() method for more information.
- other_params: dict optional
Corresponds to the advanced settings. The dictionary contains {<parameter_name>: <parameter_value>}. The possible parameters are:
'max_tasks'
'segment_column_name'
'calculate_cross_statistics'
'calculate_sql_expressions'
'cutting_strategy'
'encoding_strategy'
See Common APL Aliases for Model Training in the SAP HANA APL Developer Guide.
For 'max_tasks', see FUNC_HEADER.
- other_train_apl_aliases: dict, optional
Users can provide APL aliases as advanced settings to the model. Unlike 'other_params' described above, users are free to input any possible value. There is no control in python.
Notes
The algorithm may detect less clusters than requested. This happens when a cluster detected on the estimation dataset was not found on the validation dataset. In that case, this cluster will be considered unstable and will then be removed from the model. Users can get the number of clusters actually found in the "INDICATORS" table. For example,
# The actual number of clusters found d = model_u.get_indicators().collect() d[d.KEY=='FinalClusterCount'][['KEY','VALUE']]
It is highly recommended to use a dataset with a key provided in the fit() method. If not, once the model is trained, it will not be possible anymore to use the predict() method with a key, because the model will not expect it.
By default, when it is not given, SAP HANA APL guesses the variable description by reading the first 100 rows. But, sometimes, it does not provide the correct result. By specifically providing values for these parameters, the user can overwrite the default guess. For example:
model.set_params(variable_storages={ 'ID': 'integer', 'sepal length (cm)': 'number' }) model.set_params(variable_value_types={ 'sepal length (cm)': 'continuous' }) model.set_params(variable_missing_strings={ 'sepal length (cm)': '-1' })
Examples
>>> from hana_ml.algorithms.apl.clustering import AutoSupervisedClustering >>> from hana_ml.dataframe import ConnectionContext, DataFrame
Connecting to SAP HANA
>>> CONN = ConnectionContext('HDB_HOST', HDB_PORT, 'HDB_USER', 'HDB_PASS') >>> # -- Creates Hana DataFrame >>> hana_df = DataFrame(CONN, 'select * from APL_SAMPLES.CENSUS')
Creating and fitting the model
>>> model = AutoSupervisedClustering(nb_clusters=5) >>> model.fit(data=hana_df, key='id', label='class')
Debriefing
>>> model.get_metrics() OrderedDict([('SimplifiedSilhouette', 0.3448029020802121), ('RSS', 4675.706587754118),...
Predicting which cluster a data point belongs to
>>> applyout_df = model.predict(hana_df) >>> applyout_df.collect() # returns the output as a pandas DataFrame id CLOSEST_CLUSTER_1 DISTANCE_TO_CLOSEST_CENTROID_1 0 30 3 0.640378 1 63 4 0.611050 2 66 3 0.640378 3 110 4 0.611050 4 335 1 0.851054
Determining the 2 closest clusters
>>> model.set_params(extra_applyout_settings={'mode':'closest_distances', 'nb_distances': 2}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect() # returns the output as a pandas DataFrame id CLOSEST_CLUSTER_1 ... CLOSEST_CLUSTER_2 DISTANCE_TO_CLOSEST_CENTROID_2 0 30 3 ... 4 0.730330 1 63 4 ... 1 0.851054 2 66 3 ... 4 0.730330 3 110 4 ... 1 0.851054 4 335 1 ... 4 0.906003
Retrieving the distances to all clusters
>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'}) >>> applyout_df = model.predict(hana_df) >>> applyout_df.collect() # returns the output as a pandas DataFrame id DISTANCE_TO_CENTROID_1 ... DISTANCE_TO_CENTROID_5 0 30 0.851054 ... 1.160697 1 63 0.751054 ... 1.160697 2 66 0.906003 ... 1.160697
Saving the model in the schema named 'MODEL_STORAGE'. Please see model_storage class for further features of model storage.
>>> model_storage = ModelStorage(connection_context=CONN, schema='MODEL_STORAGE') >>> model.name = 'My model name' >>> model_storage.save_model(model=model, if_exists='replace')
Reloading the model for further uses Please note that the label has to be specified again prior to calling predict()
>>> model2 = AutoSupervisedClustering() >>> model2.set_params(label='class') >>> model2.load_model(schema_name='MySchema', table_name='MyTable') >>> applyout2 = model2.predict(hana_df) >>> applyout2.head(3).collect() id CLOSEST_CLUSTER_1 DISTANCE_TO_CLOSEST_CENTROID_1 0 30 3 0.640378 1 63 4 0.611050 2 66 3 0.640378
Exporting the SQL apply code
>>> model = AutoSupervisedClustering(CONN, nb_clusters=5, calculate_sql_expressions='enabled') >>> model.fit(data=hana_df, key='id', label='class') >>> sql = model.export_apply_code(code_type='HANA', key='id', schema_name='APL_SAMPLES', table_name='CENSUS')
- Attributes
- model_hana_ml DataFrame
The trained model content
- summary_APLArtifactTable
The reference to the "SUMMARY" table generated by the model training. This table contains the summary about the model training.
- indicators_APLArtifactTable
The reference to the "INDICATORS" table generated by the model training. This table contains the various metrics related to the model and its variables.
- fit_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table generated by the model training
- var_desc_APLArtifactTable
The reference to the "VARIABLE_DESCRIPTION" table that was built during the model training
- applyout_hana_ml DataFrame
The predictions generated the last time the model was applied
- predict_operation_logs_: APLArtifactTable
The reference to the "OPERATION_LOG" table when a prediction was made
Methods
HANA execution will be disabled and only SQL script will be generated.
HANA execution will be enabled.
export_apply_code
(code_type[, key, label, ...])Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL.
fit
(data[, key, label, features, weight])Fits the model.
fit_predict
(data[, key, label, features, weight])Fits a clustering model and uses it to generate prediction output on the training dataset.
Gets the version and configuration information about the installation of SAP HANA APL.
Return the object recorder (for Design-time artifacts generation)
get_debrief_report
(report_name)Retrieves a standard statistical report.
Retrieves the operation log table after the model training.
Retrieves the Indicator table after model training.
Returns metrics about the model.
Get information about an existing model.
Retrieves attributes of the current object.
Retrieves the operation log table after the model training.
Retrieves the summary table after model training.
Checks if the model can be saved.
load_model
(schema_name, table_name[, oid])Loads the model from a table.
predict
(data)Predicts which cluster each specified row belongs to.
save_artifact
(artifact_df, schema_name, ...)Saves an artifact, a temporary table, into a permanent table.
save_model
(schema_name, table_name[, ...])Saves the model into a table.
schedule_fit
(output_table_name_model, ...)Creates a HANA scheduler job for the model fitting.
schedule_predict
(output_table_name_applyout, ...)Creates a HANA scheduler job for the model prediction.
set_params
(**parameters)Sets attributes of the current model
set_scale_out
([route_to, no_route_to, ...])Specifies hints for scaling-out environment.
- set_params(**parameters)
Sets attributes of the current model
- Parameters
- paramsdictionary
containing attribute names and values
- fit(data, key=None, label=None, features=None, weight=None)
Fits the model.
- Parameters
- datahana_ml DataFrame
The training dataset
- keystr, optional
The name of the ID column. This column will not be used as feature in the model. It will be output as row-id when prediction is made with the model. If key is not provided, an internal key is created. But this is not recommended.
- labelstr, option
The name of the label column. If it is not given, the model 'label' attribute will be taken. If this latter is not defined, an error will be raised.
- featureslist of str, optional
The names of the features to be used in the model. If features is not provided, all columns will be taken except the ID and the label columns.
- weightstr, optional
The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.
- Returns
- selfobject
- predict(data)
Predicts which cluster each specified row belongs to.
- Parameters
- datahana_ml DataFrame
The set of rows for which to generate cluster predictions. This dataset must have the same structure as the one used in the fit() method.
- Returns
- hana_ml DataFrame
By default, the ID of the closest cluster and the distance to its center are provided. Users can request different outputs by setting the extra_applyout_settings parameter in the model. The extra_applyout_settings parameter is a dictionary with 'mode' and 'nb_distances' as keys. If mode is set to 'closest_distances', cluster IDs and distances to centroids will be provided from the closest to the furthest cluster. The output columns will be:
<The key column name>,
CLOSEST_CLUSTER_1,
DISTANCE_TO_CLOSEST_CENTROID_1,
CLOSEST_CLUSTER_2,
DISTANCE_TO_CLOSEST_CENTROID_2,
...
If mode is set to 'all_distances', the distances to the centroids of all clusters will be provided in cluster ID order. The output columns will be:
ID,
DISTANCE_TO_CENTROID_1,
DISTANCE_TO_CENTROID_2,
...
nb_distances limits the output to the closest clusters. It is only valid when mode is 'closest_distances' (it will be ignored if mode = 'all distances'). It can be set to 'all' or a positive integer.
Examples
Retrieves the IDs of the 3 closest clusters and the distances to their centroids:
>>> extra_applyout_settings = {'mode': 'closest_distances', 'nb_distances': 3} >>> model.set_params(extra_applyout_settings=extra_applyout_settings) >>> out = model.predict(hana_df) >>> out.head(3).collect() id CLOSEST_CLUSTER_1 ... CLOSEST_CLUSTER_3 DISTANCE_TO_CLOSEST_CENTROID_3 0 30 3 ... 4 0.730330 1 63 4 ... 1 0.851054 2 66 3 ... 4 0.730330
Retrieves the distances to all clusters:
>>> model.set_params(extra_applyout_settings={'mode': 'all_distances'}) >>> out = model.predict(hana_df) >>> out.head(3).collect() id DISTANCE_TO_CENTROID_1 DISTANCE_TO_CENTROID_2 ... DISTANCE_TO_CENTROID_5 0 30 0.994595 0.877414 ... 0.782949 1 63 0.994595 0.985202 ... 0.782949 2 66 0.994595 0.877414 ... 0.782949
- fit_predict(data, key=None, label=None, features=None, weight=None)
Fits a clustering model and uses it to generate prediction output on the training dataset.
- Parameters
- datahana_ml DataFrame
The input dataset
- keystr, optional
The name of the ID column
- labelstr
The name of the label column
- featureslist of str, optional.
The names of the feature columns. If features is not provided, all non-ID and non-label columns will be taken.
- weightstr, optional
The name of the weight variable. A weight variable allows one to assign a relative weight to each of the observations.
- Returns
- hana_ml DataFrame.
The output is the same as the predict() method.
Notes
Please see the predict() method so as to get different outputs with the 'extra_applyout_settings' parameter.
- get_metrics()
Returns metrics about the model.
- Returns
- Dictionary or pandas DataFrame
If no segment column is given, a dictionary object containing a set of clustering metrics and their values.
If a segment column is given, a pandas DataFrame which contains the metrics for each segment.
Examples
>>> model.get_metrics() {'SimplifiedSilhouette': 0.14668968897882997, 'RSS': 24462.640041325714, 'IntraInertia': 3.2233573348587714, 'Frequency': { 1: 0.3167862345729914, 2: 0.35590005772243755, 3: 0.3273137077045711}, 'IntraInertia': {1: 0.7450335510518645, 2: 0.708350629565789, 3: 0.7006679558645009}, 'RSS': {1: 8586.511675872738, 2: 9171.723951617836, 3: 8343.554018434477}, 'SimplifiedSilhouette': {1: 0.13324659043317924, 2: 0.14182734764281074, 3: 0.1311620470933516}, 'TargetMean': {1: 0.1744734931009441, 2: 0.022912917070469333, 3: 0.3895408163265306}, 'TargetStandardDeviation': {1: 0.37951613049526484, 2: 0.14962591788119842, 3: 0.48764615116105525}, 'KL': {1: OrderedDict([('hours-per-week', 0.2971627592049324), ('occupation', 0.11944355994892383), ('relationship', 0.06772624975990414), ('education-num', 0.06377345492340795), ('education', 0.06377345492340793), ...
- load_model(schema_name, table_name, oid=None)
Loads the model from a table.
- Parameters
- schema_name: str
The schema name
- table_name: str
The table name
- oidstr. optional
If the table contains several models, the OID must be given as an identifier. If it is not provided, the whole table is read.
Notes
Prior to using a reloaded model for a new prediction, it is necessary to re-specify the 'label' parameter. Otherwise, the predict() method will fail.
Examples
>>> # needs to re-specify time_column_name for view creation >>> model = AutoTimeSeries(label='class') >>> model.load_model(schema_name='MY_SCHEMA', table_name='MY_MODEL_TABLE') >>> model.predict(hana_df)
- disable_hana_execution()
HANA execution will be disabled and only SQL script will be generated.
- enable_hana_execution()
HANA execution will be enabled.
- export_apply_code(code_type, key=None, label=None, schema_name=None, table_name=None, other_params=None)
Exports code (SQL, JSON, etc.) so that you can apply a trained model outside SAP HANA APL. See Function Reference > Predictive Model Services > EXPORT_APPLY_CODE in the SAP HANA APL Developer Guide.
- Parameters
- code_type: str
The type of code exported. The supported code types for each type of model are given in the developer guide (see APL/CodeType).
- key: str, optional
The name of the primary key column. Required for some code types.
- label: str, optional
The name of the label (target) column. Used only when the model supports multiple targets. When left to "", this means that all targets must be generated.
- schema_name: str, optional
The schema name of the apply-in table. Required for some code types.
- table_name: str, optional
The apply-in table name. Required for some code types.
- other_params: dict, optional
The additional parameters to be included in the configuration. The available parameters are given in the developer guide.
- Returns
- The exported code: str
Examples
Exporting a model in JSON format (available for Gradient Boosting and Robust Regression)
>>> json_export = model.export_apply_code('JSON')
APL provides a JavaScript runtime in which you can make predictions based on any model that has been exported in JSON format. See JavaScript Runtime in the SAP HANA APL Developer Guide.
Exporting SQL apply code (available for Robust Regression and Clustering)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS')
Exporting SQL apply code (probability generated in the output)
>>> sql = model.export_apply_code( ... code_type='HANA', ... key='id', ... schema_name='APL_SAMPLES', ... table_name='CENSUS', ... other_params={'APL/ApplyExtraMode': 'Advanced Apply Settings', ... 'APL/ApplyProba': 'true'})
- get_apl_version()
Gets the version and configuration information about the installation of SAP HANA APL.
- Returns
- A pandas Dataframe with detailed information about the current version.
Notes
Error is raised when the call fails. The cause can be that either SAP HANA APL is not installed or the current user does not have the appropriate rights.
- get_artifacts_recorder()
Return the object recorder (for Design-time artifacts generation)
- get_debrief_report(report_name)
Retrieves a standard statistical report. See Statistical Reports in the SAP HANA APL Developer Guide.
- Parameters
- report_name: str
- Returns
- Statistical report: hana_ml DataFrame
- get_fit_operation_log()
Retrieves the operation log table after the model training. This table contains the log given by SAP HANA APL during the last fit operation.
- Returns
- The reference to OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs of the last model training
- get_indicators()
Retrieves the Indicator table after model training.
- Returns
- The reference to INDICATORS tablehana_ml DataFrame
This table provides the performance metrics of the last model training
- get_model_info()
Get information about an existing model. This method is especially useful when a trained model was saved and reloaded. After having called this method, the model can provide summary and metrics again as there were in the last fit.
- Returns
- list
List of HANA DataFrames respectively corresponding to the following tables:
Summary table
Variable roles table
Variable description table
Indicators table
Profit curves table
- get_params()
Retrieves attributes of the current object. This method is implemented for compatibility with Scikit Learn.
- Returns
- The attribute-values of the modeldictionary
- get_predict_operation_log()
Retrieves the operation log table after the model training.
- Returns
- The reference to the OPERATION_LOG tablehana_ml DataFrame
This table provides detailed logs about the last prediction
- get_summary()
Retrieves the summary table after model training.
- Returns
- The reference to the SUMMARY tablehana_ml DataFrame
This contains execution summary of the last model training
- is_fitted()
Checks if the model can be saved. To be overridden if the model is not stored in model_ attribute.
- Returns
- bool
True if the model is ready to be saved.
- save_artifact(artifact_df, schema_name, table_name, if_exists='fail', new_oid=None)
Saves an artifact, a temporary table, into a permanent table. The model has to be trained or fitted beforehand.
- Parameters
- schema_name: str
The schema name
- artifact_dfhana_ml DataFrame
The artifact created after fit or predict methods are called
- table_name: str
The table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises a ValueError
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one like to save data into the same table.
Examples
>>> myModel.save_artifact(artifactTable=myModel.indicators_, ... schema_name='MySchema', ... table_name='MyModel_Indicators', ... if_exists='replace')
- save_model(schema_name, table_name, if_exists='fail', new_oid=None)
Saves the model into a table. The model has to be trained beforehand. The model can be saved either into a new table (if_exists='replace'), or an existing table (if_exists='append'). In the latter case, the user can provide an identifier value (new_oid). The oid must be unique. By default, this oid is set when the model is created in Python (model.id attribute).
Warning
This method is deprecated. Please use hana_ml.model_storage.ModelStorage.
- Parameters
- schema_name: str
The schema name
- table_name: str
Table name
- if_exists: str. {'fail', 'replace', 'append'}, default 'fail'
The behavior when the table already exists:
fail: Raises an Error
replace: Drops the table before inserting new values
append: Inserts new values to the existing table
- new_oid: str. Optional.
If it is given, it will be inserted as a new OID value. It is useful when one wants to save data into the same table.
- Returns
- None
The model is saved into a table with the following columns:
"OID" NVARCHAR(50), -- Serve as ID
"FORMAT" NVARCHAR(50), -- APL technical info
"LOB" CLOB MEMORY THRESHOLD NULL -- binary content of the model
- schedule_fit(output_table_name_model, output_table_name_log, output_table_name_summary, output_table_name_indicators, **schedule_kwargs)
Creates a HANA scheduler job for the model fitting. It is a wrapper function of the HANAScheduler.create_training_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- output_table_name_model: str
The output table name for the model binary.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- output_table_name_indicators: str
The output table name for the model indicators.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_training_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.schedule_fit(
output_table_name_model='OUTPUT_MODEL_BINARY', output_table_name_log='OUTPUT_FIT_LOG', output_table_name_summary='OUTPUT_SUMMARY_LOG', output_table_name_indicators='OUTPUT_FIT_INDICATORS', job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- schedule_predict(output_table_name_applyout, output_table_name_log, output_table_name_summary, **schedule_kwargs)
Creates a HANA scheduler job for the model prediction. It is a wrapper function of the HANAScheduler.create_applying_schedule() method. Its interest allows users to better specify the arguments such as the output table names.
- Parameters
- input_table_name_model: str
The input table name for the model binary.
- output_table_name_applyout: str
The output table name for the prediction data.
- output_table_name_log: str
The output table name for the log data.
- output_table_name_summary: str
The output table name for the model summary.
- schedule_kwargs: kwargs dictionary
Arguments forwarded to the HANAScheduler.create_applying_schedule method. Please refer to the documentation of the HANAScheduler.create_applying_schedule method.
- Examples
- -------
- >>> model = GradientBoostingBinaryClassifier()
- >>> model.fit(
data=data, key= 'id', label='class',
)
- >>> model.predict(data)
- >>> model.schedule_predict(
output_table_name_applyout="OUTPUT_PREDICT_APPLYOUT", output_table_name_log="OUTPUT_PREDICT_LOG", output_table_name_summary="OUTPUT_PREDICT_SUMMARY", job_name=job_name, obj=model, cron="* * * mon,tue,wed,thu,fri 1 23 45", procedure_name=procedure_name, force=True)
- set_scale_out(route_to=None, no_route_to=None, route_by=None, route_by_cardinality=None, data_transfer_cost=None, route_optimization_level=None, workload_class=None)
Specifies hints for scaling-out environment. The execution of APL functions can then be routed to a specific computation node. If all the parameters are none, i.e. not given, all the existing hints are cleared.
- Parameters
- route_tostr, optional
Routes the query to the specified volume ID or service type.
Defaults to None.
- no_route_tostr or list of str, optional
Avoids query routing to a specified volume ID or service type.
Defaults to None.
- route_bystr, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s).
Defaults to None.
- route_by_cardinalitystr or list of str, optional
Routes the query to the hosts related to the base table(s) of the specified projection view(s) with the highest cardinality from the input list.
- data_transfer_costint, optional
Guides the optimizer to use the weighting factor for the data transfer cost. The value 0 ignores the data transfer cost.
Defaults to None.
- route_optimization_level{'minimal', 'all'}, optional
Guides the optimizer to compile with
route_optimization_level
'minimal' or to default toroute_optimization_level
. If the 'minimal' compiled plan is cached, then it compiles once more using the default optimization level during the first execution. This hint is primarily used to shorten statement routing decisions during the initial compilation.- workload_classstr, optional
Routes the query via workload class.
route_to
statement hint has higher precedence thanworkload_class
statement hint.Defaults to None.
Examples
>>> from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier >>> model = GradientBoostingBinaryClassifier() >>> # Routes the execution to a specific volume ID. >>> model.set_scale_out(route_to=1025) >>> # Routes the execution to a specific service type. >>> # model.set_scale_out(route_to='computeserver') >>> # Maps the execution to a specific workload class. >>> # model.set_scale_out(workload_class="WC4") >>> # Activates the sql trace >>> # connection_context.sql_tracer.enable_sql_trace(True) >>> model.fit(data=hdb_df, key='KEY', label='Y') >>> # You can check whether the queries were effectively routed by querying: >>> # select HOST, VOLUME_ID, APPLICATION_NAME, STATEMENT_STRING from M_SQL_PLAN_CACHE.