UnifiedRegression

class hana_ml.algorithms.pal.unified_regression.UnifiedRegression(func, massive=False, group_params=None, **kwargs)

The Python wrapper for SAP HANA PAL unified-regression function.

Compared with the original regression interfaces, new features supported are listed below:

  • Regression algorithms easily switch

  • Dataset automatic partition

  • Model evaluation procedure provided

  • More metrics supported

Parameters
funcstr

The name of a specified regression algorithm.

The following algorithms(case-insensitive) are supported:

  • 'DecisionTree'

  • 'HybridGradientBoostingTree'

  • 'LinearRegression'

  • 'RandomDecisionTree'

  • 'MLP'

  • 'SVM'

  • 'GLM'

  • 'GeometricRegression'

  • 'PolynomialRegression'

  • 'ExponentialRegression'

  • 'LogarithmicRegression'

massivebool, optional

Specifies whether or not to use massive mode of unified regression.

For parameter setting in massive mode, you could use both group_params (please see the example below) or the original parameters. Using original parameters will apply for all groups. However, if you define some parameters of a group, the value of all original parameter setting will be not applicable to such group.

An example is as follows:

In this example, as 'percentage' is set in group_params for Group_1, parameter setting of 'thread_ratio' is not applicable to Group_1. Defaults to False.
group_paramsdict, optional

If massive mode is activated (massive is True), input data for regression shall be divided into different groups with different regression parameters applied. This parameter specifies the parameter values of the chosen regression algorithm func w.r.t. different groups in a dict format, where keys corresponding to group_key while values should be a dict for regression algorithm parameter value assignments.

An example is as follows:

Valid only when massive is True and defaults to None.

**kwargskeyword arguments

Arbitrary keyword arguments and please referred to the responding algorithm for the parameters' key-value pair.

Note that some parameters are disabled in the regression algorithm!

  • 'DecisionTree' : DecisionTreeRegressor

    • Disabled parameters: output_rules

    • Parameters removed from initialization but can be specified in fit(): categorical_variable

  • 'HybridGradientBoostingTree' : HybridGradientBoostingRegressor

    • Disabled parameters: calculate_importance

    • Parameters removed from initialization but can be specified in fit(): categorical_variable

  • 'LinearRegression' : LinearRegression

    • Disabled parameters: pmml_export

    • Parameters removed from initialization but can be specified in fit(): categorical_variable

    • Parameters with changed meaning : json_export, where False value now means 'Exports multiple linear regression model in PMML'.

  • 'RandomDecisionTree' : RDTRegressor

    • Disabled parameters: calculate_oob

    • Parameters removed from initialization but can be specified in fit(): categorical_variable

  • 'MLP' : MLPRegressor

    • Disabled parameters: functionality

    • Parameters removed from initialization but can be specified in fit(): categorical_variable

  • 'SVM' : SVR

    • Parameters removed from initialization but can be specified in fit(): categorical_variable

  • 'GLM' : GLM

    • Disabled parameters: output_fitted

    • Parameters removed from initialization but can be specified in fit(): categorical_variable

  • 'GeometricRegression' : BiVariateGeometricRegression

    • Disabled parameters: pmml_export

  • 'PolynomialRegression' : PolynomialRegression

    • Disabled parameters: pmml_export

  • 'ExponentialRegression' : ExponentialRegression

    • Disabled parameters: pmml_export

  • 'LogarithmicRegression' : BiVariateNaturalLogarithmicRegression

    • Disabled parameters: pmml_export

For more parameter mappings of hana_ml and HANA PAL, please refer to the doc page: Parameter Mappings.

Examples

Case 1: Training data for regression:

>>> data_tbl.collect()
  ID    X1 X2  X3       Y
0  0  0.00  A   1  -6.879
1  1  0.50  A   1  -3.449
2  2  0.54  B   1   6.635
3  3  1.04  B   1  11.844
4  4  1.50  A   1   2.786
5  5  0.04  B   2   2.389
6  6  2.00  A   2  -0.011
7  7  2.04  B   2   8.839
8  8  1.54  B   1   4.689
9  9  1.00  A   2  -5.507

Create an UnifiedRegression instance for linear regression problem:

>>> mlr_params = dict(solver = 'qr',
                      adjusted_r2=False,
                      thread_ratio=0.5)
>>> umlr = UnifiedRegression(func='LinearRegression', **mlr_params)

Fit the UnifiedRegression instance with the aforementioned training data:

>>> par_params = dict(partition_method='random',
                      training_percent=0.7,
                      partition_random_state=2,
                      output_partition_result=True)
>>> umlr.fit(data = data_tbl,
             key = 'ID',
             label = 'Y',
             **par_params)

Check the resulting statistics on testing data:

>>> umlr.statistics_.collect()
        STAT_NAME          STAT_VALUE
0       TEST_EVAR   0.871459247598903
1        TEST_MAE  2.0088082000000003
2       TEST_MAPE  12.260003987804756
3  TEST_MAX_ERROR   5.329849599999999
4        TEST_MSE   9.551661310681718
5         TEST_R2  0.7774293644548433
6       TEST_RMSE    3.09057621013974
7      TEST_WMAPE  0.7188006440839695

Data for prediction:

>>> data_pred.collect()
  ID       X1 X2  X3
0  0    1.690  B   1
1  1    0.054  B   2
2  2  980.123  A   2
3  3    1.000  A   1
4  4    0.563  A   1

Perform prediction:

>>> pred_res = mlr.predict(data = data_pred, key = 'ID')
>>> pred_res.collect()
   ID        SCORE UPPER_BOUND LOWER_BOUND REASON
0   0     8.719607        None        None   None
1   1     1.416343        None        None   None
2   2  3318.371440        None        None   None
3   3    -2.050390        None        None   None
4   4    -3.533135        None        None   None

Data for scoring:

>>> data_score.collect()
   ID       X1 X2  X3    Y
0   0    1.690  B   1  1.2
1   1    0.054  B   2  2.1
2   2  980.123  A   2  2.4
3   3    1.000  A   1  1.8
4   4    0.563  A   1  1.0

Perform scoring:

>>> score_res = umlr.score(data = data_score, key = "ID", label = 'Y')

Check the statistics on scoring data:

>>> score_res[1].collect()
   STAT_NAME         STAT_VALUE
0       EVAR  -6284768.906191169
1        MAE   666.5116459919999
2       MAPE   278.9837795885635
3  MAX_ERROR  3315.9714402299996
4        MSE   2199151.795823181
5         R2   -7854112.55651136
6       RMSE  1482.9537402842952
7      WMAPE   392.0656741129411

Case 2: UnifiedReport for UnifiedRegression is shown as follows:

>>> hgr = UnifiedRegression(func = 'HybridGradientBoostingTree')
>>> gscv = GridSearchCV(estimator=hgr,
                        param_grid={'learning_rate': [0.1, 0.4, 0.7, 1],
                                    'n_estimators': [4, 6, 8, 10],
                                    'split_threshold': [0.1, 0.4, 0.7, 1]},
                        train_control=dict(fold_num=5,
                                           resampling_method='cv',
                                           random_state=1),
                        scoring='rmse')
>>> gscv.fit(data=diabetes_train, key= 'ID',
         label='CLASS',
         partition_method='random',
         partition_random_state=1,
         build_report=True)

To see the model report:

>>> UnifiedReport(gscv.estimator).display()
../../_images/unified_report_model_report_regression.png

Case 3: Local interpretability of models - linear SHAP

>>> umlr = UnifiedRegression(func='LinearRegression')
>>> umlr.fit(data=df_train, background_size=4)#specify positive background data size to activate local interpretability
>>> res = umlr.predict(data=df_predict,
...                    ...,
...                    top_k_attributions=5,
...                    sample_size=0,
...                    random_state=2022,
...                    ignore_correlation=False)#consider correlations between features, only for linear SHAP

Case 4: Local interpretability of models - tree SHAP for tree model

>>> udtr = UnifiedRegression(func='DecisionTree')
>>> udtr.fit(data=df_train)
>>> res = udtr.predict(data=df_predict,
...                    ...,
...                    top_k_attributions=8,
...                    attribution_method='tree-shap',#specify attribution method to activate local interpretability
...                    random_state=2022)

Case 5: Local interpretability of models - kernel SHAP for non-linear/non-tree models

>>> usvr = UnifiedRegression(func='SVM')# SVM model
>>> usvr.fit(data=df_train, background_size=8)#specify positive background data size to activate local interpretability
>>> res = usvr.predict(data=df_predict,
...                    ...,
...                    top_k_attributions=6,
...                    sample_size=6,
...                    random_state=2022)
Attributes
model_DataFrame

Model content.

statistics_DataFrame

Names and values of statistics.

optimal_param_DataFrame

Provides optimal parameters selected.

Available only when parameter selection is triggered.

partition_DataFrame

Partition result of training data.

Available only when training data has an ID column and random partition is applied.

error_msg_DataFrame

Error massage, only available when massive is True.

Methods

build_report()

Build model report.

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

disable_mlflow_autologging()

It will disable mlflow autologging.

enable_mlflow_autologging([schema, meta, ...])

It will enable mlflow autologging.

fit(data[, key, features, label, purpose, ...])

Fit function for unified regression.

generate_html_report(filename)

Save model report as a html file.

generate_notebook_iframe_report()

Render model report as a notebook iframe.

get_feature_importances()

Return the feature importances

get_optimal_parameters()

Return the optimal parameters.

get_performance_metrics()

Return the performance metrics.

predict(data[, key, features, model, ...])

Predict with the regression model.

score(data[, key, features, label, model, ...])

Users can use the score function to evaluate the model quality.

set_framework_version(framework_version)

Switch v1/v2 version of report.

set_model_state(state)

Set the model state by state information.

update_cv_params(name, value, typ)

Update parameters for model-evaluation/parameter-selection.

set_shapley_explainer_of_predict_phase

set_shapley_explainer_of_score_phase

disable_mlflow_autologging()

It will disable mlflow autologging.

enable_mlflow_autologging(schema=None, meta=None, is_exported=False, registered_model_name=None)

It will enable mlflow autologging.

Parameters
schemastr, optional

Define the model storage schema for mlflow autologging.

Defaults to the current schema.

metastr, optional

Define the model storage meta table for mlflow autologging.

Defaults to 'HANAML_MLFLOW_MODEL_STORAGE'.

is_exportedbool, optional

Determine whether export the HANA model to mlflow.

Defaults to False.

registered_model_namestr, optional

MLFlow registered_model_name.

update_cv_params(name, value, typ)

Update parameters for model-evaluation/parameter-selection.

fit(data, key=None, features=None, label=None, purpose=None, partition_method=None, partition_random_state=None, training_percent=None, output_partition_result=None, categorical_variable=None, background_size=None, background_random_state=None, build_report=False, impute=False, strategy=None, strategy_by_col=None, als_factors=None, als_lambda=None, als_maxit=None, als_randomstate=None, als_exit_threshold=None, als_exit_interval=None, als_linsolver=None, als_cg_maxit=None, als_centering=None, als_scaling=None, group_key=None, group_params=None, output_coefcov=None)

Fit function for unified regression.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of ID column.

In single mode, if key is not provided, then: if data is indexed by a single column, then key defaults to that index column; Otherwise, it is assumed that data contains no ID column.

In massive mode, defaults to the first-non group key column of data if the index columns of data is not provided. Otherwise, defaults to the second of index columns of data and the first column of index columns is group_key.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key, non-label columns.

labelstr or list of str, optional

Name of the dependent variable.

Should be a list of two strings for GLM models with family being 'binomial'.

If label is not provided, it defaults to:

  • the first non-key column of data, when func parameter from initialization function takes the following values:'GeometricRegression', 'PolynomialRegression', 'LinearRegression', 'ExponentialRegression', 'GLM' (except when family is 'binomial')

  • the first two non-key columns of data, when func parameter in initialization function takes the value of 'GLM' and familly is specified as 'binomial'.

purposestr, optional

Indicates the name of purpose column which is used for predefined data partition.

The meaning of value in the column for each data instance is shown below:

  • 1 : training

  • 2 : testing

Mandatory and valid only when partition_method is 'predefined'..

partition_method{'no', 'predefined', 'random'}, optional

Defines the way to divide the dataset.

  • 'no' : no partition.

  • 'predefined' : predefined partition.

  • 'random' : random partition.

Defaults to 'no'.

partition_random_stateint, optional

Indicates the seed used to initialize the random number generator for data partition.

Valid only when partition_method is set to 'random'.

  • 0 : Uses the system time.

  • Not 0 : Uses the specified seed.

Defaults to 0.

training_percentfloat, optional

The percentage of data used for training.

Value range: 0 <= value <= 1.

Defaults to 0.8.

output_partition_resultbool, optional

Specifies whether or not to output the partition result of data in data partition table.

Valid only when key is provided and partition_method is set to 'random'.

Defaults to False.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

No default value.

background_sizeint, optional

Specifies the size of background data used for Shapley Additive Explanations (SHAP) values calculation.

Should not larger than the size of training data.

Valid only for Exponential Regression, Generalized Linear Models(GLM), Linear Regression, Multi-layer Perceptron and Support Vector Regression.

Defaults to 0(no background data, in which case the calculation of SHAP values shall be disabled).

background_random_stateint, optional

Specifies the seed for random number generator in the background data sampling.

  • 0 : Uses current time as seed

  • Others : The specified seed value

Valid only for Exponential Regression, Generalized Linear Models(GLM), Linear Regression, Multi-layer Perceptron and Support Vector Regression(SVR).

Defaults to 0.

build_reportbool, optional

Whether to build report or not.

Defaults to False.

imputebool, optional

Specifies whether or not to impute missing values in the training data.

Defaults to False.

strategy{'non', 'most_frequent-mean', 'most_frequent-median', 'most_frequent-zero', 'most_frequent-als', 'delete'}, optional

Specifies the overall imputation strategy for the input training data.

  • 'non' : No imputation for all columns.

  • 'most_frequent-mean' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its mean.

  • 'most_frequent-median' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its median.

  • 'most_frequent-zero' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in all numerical columns by zeros.

  • 'most_frequent-als' : Replacing missing values in any categorical column by its most frequently observed value, and filling the missing values in all numerical columns via a matrix completion technique called alternating least squares.

  • 'delete' : Delete all rows with missing values.

Valid only when impute is True.

Defaults to 'most_frequent-mean'.

strategy_by_colListOfTuples, optional

Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.

Each tuple in the list should contain at least two elements, such that:

  • the 1st element is the name of a column;

  • the 2nd element is the imputation strategy of that column, valid strategies include: 'non', 'delete', 'most_frequent', 'categorical_const', 'mean', 'median', 'numerical_const', 'als'.

  • If the imputation strategy is 'categorical_const' or 'numerical_const', then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column.

An example for illustration:

[('V1', 'categorical_const', '0'), ('V5','median')]

Valid only when impute is True.

No default value.

Note

The following parameters all have pre-fix 'als_', and are invoked only when 'als' is selected as a valid imputation strategy in either strategy or strategy_by_col. Those parameters are for setting up the alternating-least-square(ALS) model for data imputation.

als_factorsint, optional

Length of factor vectors in the ALS model.

It should be less than the number of numerical columns, so that the imputation results would be meaningful.

Defaults to 3.

als_lambdafloat, optional

L2 regularization applied to the factors in the ALS model.

Should be non-negative.

Defaults to 0.01.

als_maxitint, optional

Maximum number of iterations for solving the ALS model.

Defaults to 20.

als_randomstateint, optional

Specifies the seed of the random number generator used in the training of ALS model:

  • 0: Uses the current time as the seed,

  • Others: Uses the specified value as the seed.

Defaults to 0.

als_exit_thresholdfloat, optional

Specify a value for stopping the training of ALS model. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.

0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.

Defaults to 0.

als_exit_intervalint, optional

Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified exit_threshold is reached.

Defaults to 5.

als_linsolver{'cholesky', 'cg'}, optional

Linear system solver for the ALS model.

  • 'cholesky' is usually much faster.

  • 'cg' is recommended when als_factors is large.

Defaults to 'cholesky'.

als_maxitint, optional

Specifies the maximum number of iterations for cg algorithm.

Invoked only when the 'cg' is the chosen linear system solver for ALS.

Defaults to 3.

als_centeringbool, optional

Whether to center the data by column before training the ALS model.

Defaults to True.

als_scalingbool, optional

Whether to scale the data by column before training the ALS model.

Defaults to True.

group_keystr, optional

The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type INT, only parameters set in the group_params are valid.

This parameter is only valid when massive is set as True in class instance initialization.

Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.

group_paramsdict, optional

If massive mode is activated (massive is set as True in class instance initialization), input data for regression shall be divided into different groups with different regression parameters applied. This parameter specifies the parameter values of the chosen regression algorithm func w.r.t. different groups in a dict format, where keys corresponding to group_key while values should be a dict for regression algorithm parameter value assignments.

An example is as follows:

Valid only when massive is set as True in class instance initialization.

Defaults to None.

output_coefcov: bool, optional

Specifies whether or not to output coefficient covariance information for Liear Regression.

Valid only if func is specified as 'LinearRegression' and json_export as True.

Defaults to False.

Note

To enable output of confidence/prediction interval for Linear Regression model in UnifiedClassification during predicting/scoring phase, we need to set output_coefcov as 1.

Returns
A fitted object.
get_optimal_parameters()

Return the optimal parameters.

get_performance_metrics()

Return the performance metrics.

get_feature_importances()

Return the feature importances

predict(data, key=None, features=None, model=None, thread_ratio=None, prediction_type=None, significance_level=None, handle_missing=None, block_size=None, top_k_attributions=None, attribution_method=None, sample_size=None, random_state=None, ignore_correlation=None, impute=False, strategy=None, strategy_by_col=None, als_factors=None, als_lambda=None, als_maxit=None, als_randomstate=None, als_exit_threshold=None, als_exit_interval=None, als_linsolver=None, als_cg_maxit=None, als_centering=None, als_scaling=None, group_key=None, group_params=None, interval_type=None)

Predict with the regression model.

Parameters
dataDataFrame

Data to be predicted.

keystr, optional

Name of ID column.

In single mode, mandatory if data is not indexed, or the index of data contains multiple columns. Defaults to the single index column of data if not provided.

In massive mode, defaults to the first-non group key column of data if the index columns of data is not provided. Otherwise, defaults to the second of index columns of data and the first column of index columns is group_key.

featuresListOfStrings, optional

Names of feature columns in data for prediction.

Defaults all non-ID columns in data if not provided.

modelDataFrame

Fitted regression model.

Defaults to self.model_.

thread_ratiofloat, optional

Controls the proportion of available threads to use for prediction.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to the PAL's default value.

prediction_typestr, optional

Specifies the type of prediction. Valid options include:

  • 'response' : direct response (with link)

  • 'link' : linear response (without link)

Valid only for GLM models.

Defaults to 'response'.

significance_levelfloat, optional

Specifies significance level for the confidence interval and prediction interval.

Valid only for the following 2 cases:

  • GLM model with IRLS solver applied(i.e. func is specified as 'GLM' and solver as 'irls' during class instance initialization).

  • Linear Regression model with json model imported(i.e. func is specified as 'LinearRegression' and json_export as True during class instance initialization).

Defaults to 0.05.

handle_missingstr, optional

Specifies the way to handle missing values. Valid options include:

  • 'skip' : skip(i.e. remove) rows with missing values

  • 'fill_zero' : replace missing values with 0.

Valid only for GLM models.

Defaults to 'fill_zero'.

block_sizeint, optional

Specifies the number of data loaded per time during scoring.

  • 0: load all data once

  • Others: the specified number

This parameter is for reducing memory consumption, especially as the predict data is huge, or it consists of a large number of missing independent variables. However, you might lose some efficiency.

Valid only for RandomDecisionTree(RDT) models.

Defaults to 0.

top_k_attributionsint, optional

Specifies the number of features with highest attributions to output.

Defaults to 10.

attribution_method{'no', 'saabas', 'tree-shap'}, optional

Specifies which method to use for model reasoning.

  • 'no' : No reasoning

  • 'saabas' : Saabas method

  • 'tree-shap' : Tree SHAP method

Valid only for tree-based models, i.e. DecisionTree, RandomDecisionTree and HybridGradientBoostingTree models.

Defaults to 'tree-shap'.

sample_sizeint, optional

Specifies the number of sampled combinations of features.

  • 0 : Heuristically determined by algorithm

  • Others : The specified sample size

Valid only for Exponential Regression, GLM, Linear Regression, MLP and Support Vector Regression.

Defaults to 0.

random_stateint, optional

Specifies the seed for random number generator when sampling the combination of features.

  • 0 : User current time as seed

  • Others : The actual seed

Valid only for Exponential Regression, GLM, Linear Regression, MLP and Support Vector Regression.

Defaults to 0.

ignore_correlationbool, optional

Specifies whether or not to ignore the correlation between the features.

Valid only for Exponential Regression, GLM and Linear Regression that adopt linear SHAP for local interpretability of models.

Defaults to False.

imputebool, optional

Specifies whether or not to impute missing values in data.

Defaults to False.

strategy{'non', 'most_frequent-mean', 'most_frequent-median', 'most_frequent-zero', 'most_frequent-als', 'delete'}, optional

Specifies the overall imputation strategy for the input data.

  • 'non' : No imputation for all columns.

  • 'most_frequent-mean' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its mean.

  • 'most_frequent-median' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its median.

  • 'most_frequent-zero' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in all numerical columns by zeros.

  • 'most_frequent-als' : Replacing missing values in any categorical column by its most frequently observed value, and filling the missing values in all numerical columns via a matrix completion technique called alternating least squares.

  • 'delete' : Delete all rows with missing values.

Valid only when impute is True.

Defaults to 'most_frequent-mean'.

strategy_by_colListOfTuples, optional

Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.

Each tuple in the list should contain at least two elements, such that:

  • the 1st element is the name of a column;

  • the 2nd element is the imputation strategy of that column, valid strategies include: 'non', 'delete', 'most_frequent', 'categorical_const', 'mean', 'median', 'numerical_const', 'als'.

  • If the imputation strategy is 'categorical_const' or 'numerical_const', then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column.

An example for illustration:

[('V1', 'categorical_const', '0'), ('V5','median')]

Valid only when impute is True.

No default value.

Note

The following parameters all have pre-fix 'als_', and are invoked only when 'als' is selected as a valid imputation strategy in either strategy or strategy_by_col. Those parameters are for setting up the alternating-least-square(ALS) model for data imputation.

als_factorsint, optional

Length of factor vectors in the ALS model.

It should be less than the number of numerical columns, so that the imputation results would be meaningful.

Defaults to 3.

als_lambdafloat, optional

L2 regularization applied to the factors in the ALS model.

Should be non-negative.

Defaults to 0.01.

als_maxitint, optional

Maximum number of iterations for solving the ALS model.

Defaults to 20.

als_randomstateint, optional

Specifies the seed of the random number generator used in the training of ALS model:

  • 0: Uses the current time as the seed,

  • Others: Uses the specified value as the seed.

Defaults to 0.

als_exit_thresholdfloat, optional

Specify a value for stopping the training of ALS model. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.

0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.

Defaults to 0.

als_exit_intervalint, optional

Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified exit_threshold is reached.

Defaults to 5.

als_linsolver{'cholesky', 'cg'}, optional

Linear system solver for the ALS model.

  • 'cholesky' is usually much faster.

  • 'cg' is recommended when als_factors is large.

Defaults to 'cholesky'.

als_maxitint, optional

Specifies the maximum number of iterations for cg algorithm.

Invoked only when the 'cg' is the chosen linear system solver for ALS.

Defaults to 3.

als_centeringbool, optional

Whether to center the data by column before training the ALS model.

Defaults to True.

als_scalingbool, optional

Whether to scale the data by column before training the ALS model.

Defaults to True.

group_keystr, optional

The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group_params are valid.

This parameter is only valid when massive is set as True in class instance initialization.

Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.

group_paramsdict, optional

If massive mode is activated (massive is set as True in class instance initialization), input data for regression shall be divided into different groups with different regression parameters applied. This parameter specifies the parameter values of the chosen regression algorithm func w.r.t. different groups in a dict format, where keys corresponding to group_key while values should be a dict for regression algorithm parameter value assignments.

An example is as follows:

Valid only when massive is set as True in class instance initialization.

Defaults to None.

interval_type{'no', 'confidence', 'prediction'}, optional

Specifies the type of interval to output:

  • 'no': do not calculate and output any interval

  • 'confidence': calculate and output the confidence interval

  • 'prediction': calculate and output the prediction interval

Valid only for either of the following 2 cases:

  • GLM model with IRLS solver applied(i.e. func is specified as 'GLM' and solver as 'irls' during class instance initialization).

  • Linear Regression model with json model imported and coefficient covariance information computed (i.e. func is specified as 'LinearRegression', json_export specified as True during class instance initialization, and output_coefcov specified as True during the traning phase).

Defaults to 'no'.

Returns
DataFrame

A collection of DataFrames listed as follows:

  • Prediction result by ignoring the true labels of the input data, structured the same as the result table of predict() function.

  • error msg if massive is True.

Examples

Example 1 - Linear Regression predict with confidence interval:

>>> bsh_train = conn.table('BOSTON_HOUSING_TRAIN_DATA')
>>> bsh_test = conn.table('BOSTON_HOUSING_TEST_DATA')
>>> ulr = UnifiedRegression(func='LinearRegression',
...                         json_export=True)#prediction/confidence interval only available for json model
>>> ulr.fit(data=bsh_df,
...         key='ID',
...         label='MEDV',
...         output_coefcov=True)#Set as True to output coefficient interval
>>> ulr.predict(data=bsh_test.deselect('MEDV'),
...             key='ID',
...             significance_level=0.05,
...             interval_type='confidence')#Specifies the interval type as confidence

Example 2 - GLM model predict of response with prediction interval:

>>> bsh_train = conn.table('BOSTON_HOUSING_TRAIN_DATA')
>>> bsh_test = conn.table('BOSTON_HOUSING_TEST_DATA')
>>> uglm = UnifiedRegression(func='GLM', family='gaussian', link='identity')
>>> uglm.fit(data=bsh_df, key='ID', label='MEDV')
>>> ulr.predict(data=bsh_test.deselect('MEDV'),
...             key='ID',
...             significance_level=0.05,
...             prediction_type='response',#set to 'response' for direct response
...             interval_type='prediction')#Specifies the interval type as prediction
score(data, key=None, features=None, label=None, model=None, prediction_type=None, significance_level=None, handle_missing=None, thread_ratio=None, block_size=None, top_k_attributions=None, attribution_method=None, sample_size=None, random_state=None, ignore_correlation=None, impute=False, strategy=None, strategy_by_col=None, als_factors=None, als_lambda=None, als_maxit=None, als_randomstate=None, als_exit_threshold=None, als_exit_interval=None, als_linsolver=None, als_cg_maxit=None, als_centering=None, als_scaling=None, group_key=None, group_params=None, interval_type=None)

Users can use the score function to evaluate the model quality. In the Unified regression, statistics and metrics are provided to show the model quality.

Parameters
dataDataFrame

Data for scoring.

keystr, optional

Name of the ID column.

In single mode, mandatory if data is not indexed, or the index of data contains multiple columns. Defaults to the single index column of data if not provided.

In massive mode, defaults to the first-non group key column of data if the index columns of data is not provided. Otherwise, defaults to the second of index columns of data and the first column of index columns is group_key.

featuresListOfString or str, optional

Names of feature columns.

Defaults to all non-ID, non-label columns if not provided.

labelstr, optional

Name of the label column.

Defaults to the last non-ID column if not provided.

modelDataFrame

Fitted regression model.

Defaults to self.model_.

thread_ratiofloat, optional

Controls the proportion of available threads to use for prediction.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to the PAL's default value.

prediction_typestr, optional

Specifies the type of prediction. Valid options include:

  • 'response' : direct response (with link)

  • 'link' : linear response (without link)

Valid only for GLM models.

Defaults to 'response'.

significance_levelfloat, optional

Specifies significance level for the confidence interval and prediction interval.

Valid only for the following 2 cases:

  • GLM model with IRLS solver applied(i.e. func is specified as 'GLM' and solver as 'irls' during class instance initialization).

  • Linear Regression model with json model imported(i.e. func is specified as 'LinearRegression' and json_export as True during class instance initialization).

Defaults to 0.05.

handle_missingstr, optional

Specifies the way to handle missing values. Valid options include:

  • 'skip' : skip rows with missing values

  • 'fill_zero' : replace missing values with 0.

Valid only for GLM models.

Defaults to 'fill_zero'.

block_sizeint, optional

Specifies the number of data loaded per time during scoring.

  • 0: load all data once

  • Others: the specified number

This parameter is for reducing memory consumption, especially as the predict data is huge, or it consists of a large number of missing independent variables. However, you might lose some efficiency.

Valid only for RandomDecisionTree models.

Defaults to 0.

top_k_attributionsint, optional

Specifies the number of features with highest attributions to output.

Defaults to 10.

attribution_method{'no', 'saabas', 'tree-shap'}, optional

Specifies which method to use for model reasoning.

  • 'no' : No reasoning

  • 'saabas' : Saabas method

  • 'tree-shap' : Tree SHAP method

Valid only for tree-based models, i.e. DecisionTree, RandomDecisionTree and HybridGradientBoostingTree models.

Defaults to 'tree-shap'.

sample_sizeint, optional

Specifies the number of sampled combinations of features.

  • 0 : Heuristically determined by algorithm

  • Others : The specified sample size

Valid only for Exponential Regression, GLM, Linear Regression, MLP and Support Vector Regression.

Defaults to 0.

random_stateint, optional

Specifies the seed for random number generator when sampling the combination of features.

  • 0 : User current time as seed

  • Others : The actual seed

Valid only for Exponential Regression, GLM, Linear Regression, MLP and Support Vector Regression.

Defaults to 0.

ignore_correlationbool, optional

Specifies whether or not to ignore the correlation between the features.

Valid only for Exponential Regression, GLM and Linear Regression.

Defaults to False.

imputebool, optional

Specifies whether or not to impute missing values in data.

Defaults to False.

strategy{'non', 'most_frequent-mean', 'most_frequent-median', 'most_frequent-zero', 'most_frequent-als', 'delete'}, optional

Specifies the overall imputation strategy for data.

  • 'non' : No imputation for all columns.

  • 'most_frequent-mean' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its mean.

  • 'most_frequent-median' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its median.

  • 'most_frequent-zero' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in all numerical columns by zeros.

  • 'most_frequent-als' : Replacing missing values in any categorical column by its most frequently observed value, and filling the missing values in all numerical columns via a matrix completion technique called alternating least squares.

  • 'delete' : Delete all rows with missing values.

Valid only when impute is True.

Defaults to 'most_frequent-mean'.

strategy_by_colListOfTuples, optional

Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.

Each tuple in the list should contain at least two elements, such that:

  • the 1st element is the name of a column;

  • the 2nd element is the imputation strategy of that column, valid strategies include: 'non', 'delete', 'most_frequent', 'categorical_const', 'mean', 'median', 'numerical_const', 'als'.

  • If the imputation strategy is 'categorical_const' or 'numerical_const', then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column.

An example for illustration:

[('V1', 'categorical_const', '0'), ('V5','median')]

Valid only when impute is True.

No default value.

Note

The following parameters all have pre-fix 'als_', and are invoked only when 'als' is selected as a valid imputation strategy in either strategy or strategy_by_col. Those parameters are for setting up the alternating-least-square(ALS) model for data imputation.

als_factorsint, optional

Length of factor vectors in the ALS model.

It should be less than the number of numerical columns, so that the imputation results would be meaningful.

Defaults to 3.

als_lambdafloat, optional

L2 regularization applied to the factors in the ALS model.

Should be non-negative.

Defaults to 0.01.

als_maxitint, optional

Maximum number of iterations for solving the ALS model.

Defaults to 20.

als_randomstateint, optional

Specifies the seed of the random number generator used in the training of ALS model:

  • 0: Uses the current time as the seed,

  • Others: Uses the specified value as the seed.

Defaults to 0.

als_exit_thresholdfloat, optional

Specify a value for stopping the training of ALS model. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.

0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.

Defaults to 0.

als_exit_intervalint, optional

Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified exit_threshold is reached.

Defaults to 5.

als_linsolver{'cholesky', 'cg'}, optional

Linear system solver for the ALS model.

  • 'cholesky' is usually much faster.

  • 'cg' is recommended when als_factors is large.

Defaults to 'cholesky'.

als_maxitint, optional

Specifies the maximum number of iterations for cg algorithm.

Invoked only when the 'cg' is the chosen linear system solver for ALS.

Defaults to 3.

als_centeringbool, optional

Whether to center the data by column before training the ALS model.

Defaults to True.

als_scalingbool, optional

Whether to scale the data by column before training the ALS model.

Defaults to True.

group_keystr, optional

The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group_params are valid.

This parameter is only valid when massive is set as True in class instance initialization.

Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.

group_paramsdict, optional

If massive mode is activated (massive is set as True in class instance initialization), input data for regression shall be divided into different groups with different regression parameters applied. This parameter specifies the parameter values of the chosen regression algorithm func w.r.t. different groups in a dict format, where keys corresponding to group_key while values should be a dict for regression algorithm parameter value assignments.

An example is as follows:

Valid only when massive is set True in class instance initialization.

Defaults to None.

interval_type{'no', 'confidence', 'prediction'}, optional

Specifies the type of interval to output:

  • 'no': do not calculate and output any interval

  • 'confidence': calculate and output the confidence interval

  • 'prediction': calculate and output the prediction interval

Valid only for either of the following 2 cases:

  • GLM model with IRLS solver applied(i.e. func is specified as 'GLM' and solver as 'irls' during class instance initialization).

  • Linear Regression model with json model imported and coefficient covariance information computed (i.e. func is specified as 'LinearRegression', json_export specified as True during class instance initialization, and output_coefcov specified as True during the traning phase).

Returns
DataFrame

A collection of DataFrames listed as follows:

  • Prediction result by ignoring the true labels of the input data, structured the same as the result table of predict() function.

  • Statistics results

  • error message if massive is True.

Examples

Example 1 - Linear Regression scoring with prediction interval:

>>> bsh_train = conn.table('BOSTON_HOUSING_TRAIN_DATA')
>>> bsh_test = conn.table('BOSTON_HOUSING_TEST_DATA')
>>> ulr = UnifiedRegression(func='LinearRegression',
...                         json_export=True)#prediction/confidence interval only available for json model
>>> ulr.fit(data=bsh_df,
...         key='ID',
...         label='MEDV',
...         output_coefcov=True)#Set as True to output interval
>>> ulr.predict(data=bsh_test.deselect('MEDV'),
...             key='ID',
...             significance_level=0.05,
...             interval_type='prediction')#Specifies the interval type as prediction

Example 2 - GLM model predict of linear response with confidence interval:

>>> bsh_train = conn.table('BOSTON_HOUSING_TRAIN_DATA')
>>> bsh_test = conn.table('BOSTON_HOUSING_TEST_DATA')
>>> uglm = UnifiedRegression(func='GLM', family='gaussian', link='identity')
>>> uglm.fit(data=bsh_df, key='ID', label='MEDV')
>>> ulr.predict(data=bsh_test.deselect('MEDV'),
...             key='ID',
...             significance_level=0.05,
...             prediction_type='link',#set as 'link' for linear response
...             interval_type='confidence')#Specifies the interval type as confidence
build_report()

Build model report.

create_model_state(model=None, function=None, pal_funcname='PAL_UNIFIED_REGRESSION', state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function name of the regression algorithm.

Valid options include:

  • 'SVM' : Support Vector Regression

  • 'MLP' : Multilayer Perceptron Regression

  • 'DT' : Decision Tree Regression

  • 'HGBT' : Hybrid Gradient Boosting Tree Regression

  • 'MLR' : Multiple Linear Regression

  • 'RDT' : Random Decision Trees Regression

Defaults to self.real_func.

Note

The default value could be invalid. In such case, a ValueError shall be thrown.

pal_funcnameint or str, optional

PAL function name.

Defaults to 'PAL_UNIFIED_REGRESSION'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

set_model_state(state)

Set the model state by state information.

Parameters
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

generate_html_report(filename)

Save model report as a html file.

Parameters
filenamestr

Html file name.

generate_notebook_iframe_report()

Render model report as a notebook iframe.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_framework_version(framework_version)

Switch v1/v2 version of report.

Parameters
framework_version{'v2', 'v1'}, optional

v2: using report builder framework. v1: using pure html template.

Defaults to 'v2'.

Inherited Methods from PALBase

Besides those methods mentioned above, the UnifiedRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.