UnifiedRegression
- class hana_ml.algorithms.pal.unified_regression.UnifiedRegression(func, massive=False, group_params=None, **kwargs)
The Python wrapper for SAP HANA PAL unified-regression function.
Compared with the original regression interfaces, new features supported are listed below:
Regression algorithms easily switch
Dataset automatic partition
Model evaluation procedure provided
More metrics supported
- Parameters
- funcstr
The name of a specified regression algorithm.
The following algorithms(case-insensitive) are supported:
'DecisionTree'
'HybridGradientBoostingTree'
'LinearRegression'
'RandomDecisionTree'
'MLP'
'SVM'
'GLM'
'GeometricRegression'
'PolynomialRegression'
'ExponentialRegression'
'LogarithmicRegression'
- massivebool, optional
Specifies whether or not to use massive mode of unified regression.
For parameter setting in massive mode, you could use both group_params (please see the example below) or the original parameters. Using original parameters will apply for all groups. However, if you define some parameters of a group, the value of all original parameter setting will be not applicable to such group.
An example is as follows:
- group_paramsdict, optional
If massive mode is activated (
massive
is True), input data for regression shall be divided into different groups with different regression parameters applied. This parameter specifies the parameter values of the chosen regression algorithmfunc
w.r.t. different groups in a dict format, where keys corresponding to group_key while values should be a dict for regression algorithm parameter value assignments.An example is as follows:
Valid only when
massive
is True and defaults to None.- **kwargskeyword arguments
Arbitrary keyword arguments and please referred to the responding algorithm for the parameters' key-value pair.
Note that some parameters are disabled in the regression algorithm!
'DecisionTree' :
DecisionTreeRegressor
Disabled parameters: output_rules
Parameters removed from initialization but can be specified in fit(): categorical_variable
'HybridGradientBoostingTree' :
HybridGradientBoostingRegressor
Disabled parameters: calculate_importance
Parameters removed from initialization but can be specified in fit(): categorical_variable
'LinearRegression' :
LinearRegression
Disabled parameters: pmml_export
Parameters removed from initialization but can be specified in fit(): categorical_variable
Parameters with changed meaning :
json_export
, where False value now means 'Exports multiple linear regression model in PMML'.
'RandomDecisionTree' :
RDTRegressor
Disabled parameters: calculate_oob
Parameters removed from initialization but can be specified in fit(): categorical_variable
'MLP' :
MLPRegressor
Disabled parameters: functionality
Parameters removed from initialization but can be specified in fit(): categorical_variable
'SVM' :
SVR
Parameters removed from initialization but can be specified in fit(): categorical_variable
'GLM' :
GLM
Disabled parameters: output_fitted
Parameters removed from initialization but can be specified in fit(): categorical_variable
'GeometricRegression' :
BiVariateGeometricRegression
Disabled parameters: pmml_export
'PolynomialRegression' :
PolynomialRegression
Disabled parameters: pmml_export
'ExponentialRegression' :
ExponentialRegression
Disabled parameters: pmml_export
'LogarithmicRegression' :
BiVariateNaturalLogarithmicRegression
Disabled parameters: pmml_export
For more parameter mappings of hana_ml and HANA PAL, please refer to the doc page: Parameter Mappings.
Examples
Case 1: Training data for regression:
>>> data_tbl.collect() ID X1 X2 X3 Y 0 0 0.00 A 1 -6.879 1 1 0.50 A 1 -3.449 2 2 0.54 B 1 6.635 3 3 1.04 B 1 11.844 4 4 1.50 A 1 2.786 5 5 0.04 B 2 2.389 6 6 2.00 A 2 -0.011 7 7 2.04 B 2 8.839 8 8 1.54 B 1 4.689 9 9 1.00 A 2 -5.507
Create an UnifiedRegression instance for linear regression problem:
>>> mlr_params = dict(solver = 'qr', adjusted_r2=False, thread_ratio=0.5)
>>> umlr = UnifiedRegression(func='LinearRegression', **mlr_params)
Fit the UnifiedRegression instance with the aforementioned training data:
>>> par_params = dict(partition_method='random', training_percent=0.7, partition_random_state=2, output_partition_result=True)
>>> umlr.fit(data = data_tbl, key = 'ID', label = 'Y', **par_params)
Check the resulting statistics on testing data:
>>> umlr.statistics_.collect() STAT_NAME STAT_VALUE 0 TEST_EVAR 0.871459247598903 1 TEST_MAE 2.0088082000000003 2 TEST_MAPE 12.260003987804756 3 TEST_MAX_ERROR 5.329849599999999 4 TEST_MSE 9.551661310681718 5 TEST_R2 0.7774293644548433 6 TEST_RMSE 3.09057621013974 7 TEST_WMAPE 0.7188006440839695
Data for prediction:
>>> data_pred.collect() ID X1 X2 X3 0 0 1.690 B 1 1 1 0.054 B 2 2 2 980.123 A 2 3 3 1.000 A 1 4 4 0.563 A 1
Perform prediction:
>>> pred_res = mlr.predict(data = data_pred, key = 'ID') >>> pred_res.collect() ID SCORE UPPER_BOUND LOWER_BOUND REASON 0 0 8.719607 None None None 1 1 1.416343 None None None 2 2 3318.371440 None None None 3 3 -2.050390 None None None 4 4 -3.533135 None None None
Data for scoring:
>>> data_score.collect() ID X1 X2 X3 Y 0 0 1.690 B 1 1.2 1 1 0.054 B 2 2.1 2 2 980.123 A 2 2.4 3 3 1.000 A 1 1.8 4 4 0.563 A 1 1.0
Perform scoring:
>>> score_res = umlr.score(data = data_score, key = "ID", label = 'Y')
Check the statistics on scoring data:
>>> score_res[1].collect() STAT_NAME STAT_VALUE 0 EVAR -6284768.906191169 1 MAE 666.5116459919999 2 MAPE 278.9837795885635 3 MAX_ERROR 3315.9714402299996 4 MSE 2199151.795823181 5 R2 -7854112.55651136 6 RMSE 1482.9537402842952 7 WMAPE 392.0656741129411
Case 2: UnifiedReport for UnifiedRegression is shown as follows:
>>> hgr = UnifiedRegression(func = 'HybridGradientBoostingTree') >>> gscv = GridSearchCV(estimator=hgr, param_grid={'learning_rate': [0.1, 0.4, 0.7, 1], 'n_estimators': [4, 6, 8, 10], 'split_threshold': [0.1, 0.4, 0.7, 1]}, train_control=dict(fold_num=5, resampling_method='cv', random_state=1), scoring='rmse') >>> gscv.fit(data=diabetes_train, key= 'ID', label='CLASS', partition_method='random', partition_random_state=1, build_report=True)
To see the model report:
>>> UnifiedReport(gscv.estimator).display()
Case 3: Local interpretability of models - linear SHAP
>>> umlr = UnifiedRegression(func='LinearRegression') >>> umlr.fit(data=df_train, background_size=4)#specify positive background data size to activate local interpretability >>> res = umlr.predict(data=df_predict, ... ..., ... top_k_attributions=5, ... sample_size=0, ... random_state=2022, ... ignore_correlation=False)#consider correlations between features, only for linear SHAP
Case 4: Local interpretability of models - tree SHAP for tree model
>>> udtr = UnifiedRegression(func='DecisionTree') >>> udtr.fit(data=df_train) >>> res = udtr.predict(data=df_predict, ... ..., ... top_k_attributions=8, ... attribution_method='tree-shap',#specify attribution method to activate local interpretability ... random_state=2022)
Case 5: Local interpretability of models - kernel SHAP for non-linear/non-tree models
>>> usvr = UnifiedRegression(func='SVM')# SVM model >>> usvr.fit(data=df_train, background_size=8)#specify positive background data size to activate local interpretability >>> res = usvr.predict(data=df_predict, ... ..., ... top_k_attributions=6, ... sample_size=6, ... random_state=2022)
- Attributes
- model_DataFrame
Model content.
- statistics_DataFrame
Names and values of statistics.
- optimal_param_DataFrame
Provides optimal parameters selected.
Available only when parameter selection is triggered.
- partition_DataFrame
Partition result of training data.
Available only when training data has an ID column and random partition is applied.
- error_msg_DataFrame
Error massage, only available when massive is True.
Methods
Build model report.
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
It will disable mlflow autologging.
enable_mlflow_autologging
([schema, meta, ...])It will enable mlflow autologging.
fit
(data[, key, features, label, purpose, ...])Fit function for unified regression.
generate_html_report
(filename)Save model report as a html file.
Render model report as a notebook iframe.
Return the feature importances
Return the optimal parameters.
Return the performance metrics.
predict
(data[, key, features, model, ...])Predict with the regression model.
score
(data[, key, features, label, model, ...])Users can use the score function to evaluate the model quality.
set_framework_version
(framework_version)Switch v1/v2 version of report.
set_model_state
(state)Set the model state by state information.
update_cv_params
(name, value, typ)Update parameters for model-evaluation/parameter-selection.
set_shapley_explainer_of_predict_phase
set_shapley_explainer_of_score_phase
- disable_mlflow_autologging()
It will disable mlflow autologging.
- enable_mlflow_autologging(schema=None, meta=None, is_exported=False, registered_model_name=None)
It will enable mlflow autologging.
- Parameters
- schemastr, optional
Define the model storage schema for mlflow autologging.
Defaults to the current schema.
- metastr, optional
Define the model storage meta table for mlflow autologging.
Defaults to 'HANAML_MLFLOW_MODEL_STORAGE'.
- is_exportedbool, optional
Determine whether export the HANA model to mlflow.
Defaults to False.
- registered_model_namestr, optional
MLFlow registered_model_name.
- update_cv_params(name, value, typ)
Update parameters for model-evaluation/parameter-selection.
- fit(data, key=None, features=None, label=None, purpose=None, partition_method=None, partition_random_state=None, training_percent=None, output_partition_result=None, categorical_variable=None, background_size=None, background_random_state=None, build_report=False, impute=False, strategy=None, strategy_by_col=None, als_factors=None, als_lambda=None, als_maxit=None, als_randomstate=None, als_exit_threshold=None, als_exit_interval=None, als_linsolver=None, als_cg_maxit=None, als_centering=None, als_scaling=None, group_key=None, group_params=None, output_coefcov=None)
Fit function for unified regression.
- Parameters
- dataDataFrame
Training data.
- keystr, optional
Name of ID column.
In single mode, if
key
is not provided, then: ifdata
is indexed by a single column, thenkey
defaults to that index column; Otherwise, it is assumed thatdata
contains no ID column.In massive mode, defaults to the first-non group key column of data if the index columns of data is not provided. Otherwise, defaults to the second of index columns of data and the first column of index columns is group_key.
- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-key, non-label columns.- labelstr or list of str, optional
Name of the dependent variable.
Should be a list of two strings for GLM models with
family
being 'binomial'.If
label
is not provided, it defaults to:the first non-key column of
data
, whenfunc
parameter from initialization function takes the following values:'GeometricRegression', 'PolynomialRegression', 'LinearRegression', 'ExponentialRegression', 'GLM' (except whenfamily
is 'binomial')the first two non-key columns of
data
, whenfunc
parameter in initialization function takes the value of 'GLM' andfamilly
is specified as 'binomial'.
- purposestr, optional
Indicates the name of purpose column which is used for predefined data partition.
The meaning of value in the column for each data instance is shown below:
1 : training
2 : testing
Mandatory and valid only when
partition_method
is 'predefined'..- partition_method{'no', 'predefined', 'random'}, optional
Defines the way to divide the dataset.
'no' : no partition.
'predefined' : predefined partition.
'random' : random partition.
Defaults to 'no'.
- partition_random_stateint, optional
Indicates the seed used to initialize the random number generator for data partition.
Valid only when
partition_method
is set to 'random'.0 : Uses the system time.
Not 0 : Uses the specified seed.
Defaults to 0.
- training_percentfloat, optional
The percentage of data used for training.
Value range: 0 <= value <= 1.
Defaults to 0.8.
- output_partition_resultbool, optional
Specifies whether or not to output the partition result of
data
in data partition table.Valid only when
key
is provided andpartition_method
is set to 'random'.Defaults to False.
- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical.
Other INTEGER columns will be treated as continuous.
No default value.
- background_sizeint, optional
Specifies the size of background data used for Shapley Additive Explanations (SHAP) values calculation.
Should not larger than the size of training data.
Valid only for Exponential Regression, Generalized Linear Models(GLM), Linear Regression, Multi-layer Perceptron and Support Vector Regression.
Defaults to 0(no background data, in which case the calculation of SHAP values shall be disabled).
- background_random_stateint, optional
Specifies the seed for random number generator in the background data sampling.
0 : Uses current time as seed
Others : The specified seed value
Valid only for Exponential Regression, Generalized Linear Models(GLM), Linear Regression, Multi-layer Perceptron and Support Vector Regression(SVR).
Defaults to 0.
- build_reportbool, optional
Whether to build report or not.
Defaults to False.
- imputebool, optional
Specifies whether or not to impute missing values in the training data.
Defaults to False.
- strategy{'non', 'most_frequent-mean', 'most_frequent-median', 'most_frequent-zero', 'most_frequent-als', 'delete'}, optional
Specifies the overall imputation strategy for the input training data.
'non' : No imputation for all columns.
'most_frequent-mean' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its mean.
'most_frequent-median' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its median.
'most_frequent-zero' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in all numerical columns by zeros.
'most_frequent-als' : Replacing missing values in any categorical column by its most frequently observed value, and filling the missing values in all numerical columns via a matrix completion technique called alternating least squares.
'delete' : Delete all rows with missing values.
Valid only when
impute
is True.Defaults to 'most_frequent-mean'.
- strategy_by_colListOfTuples, optional
Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.
Each tuple in the list should contain at least two elements, such that:
the 1st element is the name of a column;
the 2nd element is the imputation strategy of that column, valid strategies include: 'non', 'delete', 'most_frequent', 'categorical_const', 'mean', 'median', 'numerical_const', 'als'.
If the imputation strategy is 'categorical_const' or 'numerical_const', then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column.
- An example for illustration:
[('V1', 'categorical_const', '0'), ('V5','median')]
Valid only when
impute
is True.No default value.
Note
The following parameters all have pre-fix 'als_', and are invoked only when 'als' is selected as a valid imputation strategy in either
strategy
orstrategy_by_col
. Those parameters are for setting up the alternating-least-square(ALS) model for data imputation.- als_factorsint, optional
Length of factor vectors in the ALS model.
It should be less than the number of numerical columns, so that the imputation results would be meaningful.
Defaults to 3.
- als_lambdafloat, optional
L2 regularization applied to the factors in the ALS model.
Should be non-negative.
Defaults to 0.01.
- als_maxitint, optional
Maximum number of iterations for solving the ALS model.
Defaults to 20.
- als_randomstateint, optional
Specifies the seed of the random number generator used in the training of ALS model:
0: Uses the current time as the seed,
Others: Uses the specified value as the seed.
Defaults to 0.
- als_exit_thresholdfloat, optional
Specify a value for stopping the training of ALS model. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.
0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.
Defaults to 0.
- als_exit_intervalint, optional
Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified
exit_threshold
is reached.Defaults to 5.
- als_linsolver{'cholesky', 'cg'}, optional
Linear system solver for the ALS model.
'cholesky' is usually much faster.
'cg' is recommended when
als_factors
is large.
Defaults to 'cholesky'.
- als_maxitint, optional
Specifies the maximum number of iterations for cg algorithm.
Invoked only when the 'cg' is the chosen linear system solver for ALS.
Defaults to 3.
- als_centeringbool, optional
Whether to center the data by column before training the ALS model.
Defaults to True.
- als_scalingbool, optional
Whether to scale the data by column before training the ALS model.
Defaults to True.
- group_keystr, optional
The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type INT, only parameters set in the group_params are valid.
This parameter is only valid when
massive
is set as True in class instance initialization.Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.
- group_paramsdict, optional
If massive mode is activated (
massive
is set as True in class instance initialization), input data for regression shall be divided into different groups with different regression parameters applied. This parameter specifies the parameter values of the chosen regression algorithmfunc
w.r.t. different groups in a dict format, where keys corresponding to group_key while values should be a dict for regression algorithm parameter value assignments.An example is as follows:
Valid only when
massive
is set as True in class instance initialization.Defaults to None.
- output_coefcov: bool, optional
Specifies whether or not to output coefficient covariance information for Liear Regression.
Valid only if
func
is specified as 'LinearRegression' andjson_export
as True.Defaults to False.
Note
To enable output of confidence/prediction interval for Linear Regression model in UnifiedClassification during predicting/scoring phase, we need to set
output_coefcov
as 1.
- Returns
- A fitted object.
- get_optimal_parameters()
Return the optimal parameters.
- get_performance_metrics()
Return the performance metrics.
- get_feature_importances()
Return the feature importances
- predict(data, key=None, features=None, model=None, thread_ratio=None, prediction_type=None, significance_level=None, handle_missing=None, block_size=None, top_k_attributions=None, attribution_method=None, sample_size=None, random_state=None, ignore_correlation=None, impute=False, strategy=None, strategy_by_col=None, als_factors=None, als_lambda=None, als_maxit=None, als_randomstate=None, als_exit_threshold=None, als_exit_interval=None, als_linsolver=None, als_cg_maxit=None, als_centering=None, als_scaling=None, group_key=None, group_params=None, interval_type=None)
Predict with the regression model.
- Parameters
- dataDataFrame
Data to be predicted.
- keystr, optional
Name of ID column.
In single mode, mandatory if
data
is not indexed, or the index ofdata
contains multiple columns. Defaults to the single index column ofdata
if not provided.In massive mode, defaults to the first-non group key column of data if the index columns of data is not provided. Otherwise, defaults to the second of index columns of data and the first column of index columns is group_key.
- featuresListOfStrings, optional
Names of feature columns in data for prediction.
Defaults all non-ID columns in data if not provided.
- modelDataFrame
Fitted regression model.
Defaults to self.model_.
- thread_ratiofloat, optional
Controls the proportion of available threads to use for prediction.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to the PAL's default value.
- prediction_typestr, optional
Specifies the type of prediction. Valid options include:
'response' : direct response (with link)
'link' : linear response (without link)
Valid only for GLM models.
Defaults to 'response'.
- significance_levelfloat, optional
Specifies significance level for the confidence interval and prediction interval.
Valid only for the following 2 cases:
GLM model with IRLS solver applied(i.e.
func
is specified as 'GLM' andsolver
as 'irls' during class instance initialization).Linear Regression model with json model imported(i.e.
func
is specified as 'LinearRegression' andjson_export
as True during class instance initialization).
Defaults to 0.05.
- handle_missingstr, optional
Specifies the way to handle missing values. Valid options include:
'skip' : skip(i.e. remove) rows with missing values
'fill_zero' : replace missing values with 0.
Valid only for GLM models.
Defaults to 'fill_zero'.
- block_sizeint, optional
Specifies the number of data loaded per time during scoring.
0: load all data once
Others: the specified number
This parameter is for reducing memory consumption, especially as the predict data is huge, or it consists of a large number of missing independent variables. However, you might lose some efficiency.
Valid only for RandomDecisionTree(RDT) models.
Defaults to 0.
- top_k_attributionsint, optional
Specifies the number of features with highest attributions to output.
Defaults to 10.
- attribution_method{'no', 'saabas', 'tree-shap'}, optional
Specifies which method to use for model reasoning.
'no' : No reasoning
'saabas' : Saabas method
'tree-shap' : Tree SHAP method
Valid only for tree-based models, i.e. DecisionTree, RandomDecisionTree and HybridGradientBoostingTree models.
Defaults to 'tree-shap'.
- sample_sizeint, optional
Specifies the number of sampled combinations of features.
0 : Heuristically determined by algorithm
Others : The specified sample size
Valid only for Exponential Regression, GLM, Linear Regression, MLP and Support Vector Regression.
Defaults to 0.
- random_stateint, optional
Specifies the seed for random number generator when sampling the combination of features.
0 : User current time as seed
Others : The actual seed
Valid only for Exponential Regression, GLM, Linear Regression, MLP and Support Vector Regression.
Defaults to 0.
- ignore_correlationbool, optional
Specifies whether or not to ignore the correlation between the features.
Valid only for Exponential Regression, GLM and Linear Regression that adopt linear SHAP for local interpretability of models.
Defaults to False.
- imputebool, optional
Specifies whether or not to impute missing values in
data
.Defaults to False.
- strategy{'non', 'most_frequent-mean', 'most_frequent-median', 'most_frequent-zero', 'most_frequent-als', 'delete'}, optional
Specifies the overall imputation strategy for the input data.
'non' : No imputation for all columns.
'most_frequent-mean' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its mean.
'most_frequent-median' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its median.
'most_frequent-zero' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in all numerical columns by zeros.
'most_frequent-als' : Replacing missing values in any categorical column by its most frequently observed value, and filling the missing values in all numerical columns via a matrix completion technique called alternating least squares.
'delete' : Delete all rows with missing values.
Valid only when
impute
is True.Defaults to 'most_frequent-mean'.
- strategy_by_colListOfTuples, optional
Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.
Each tuple in the list should contain at least two elements, such that:
the 1st element is the name of a column;
the 2nd element is the imputation strategy of that column, valid strategies include: 'non', 'delete', 'most_frequent', 'categorical_const', 'mean', 'median', 'numerical_const', 'als'.
If the imputation strategy is 'categorical_const' or 'numerical_const', then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column.
- An example for illustration:
[('V1', 'categorical_const', '0'), ('V5','median')]
Valid only when
impute
is True.No default value.
Note
The following parameters all have pre-fix 'als_', and are invoked only when 'als' is selected as a valid imputation strategy in either
strategy
orstrategy_by_col
. Those parameters are for setting up the alternating-least-square(ALS) model for data imputation.- als_factorsint, optional
Length of factor vectors in the ALS model.
It should be less than the number of numerical columns, so that the imputation results would be meaningful.
Defaults to 3.
- als_lambdafloat, optional
L2 regularization applied to the factors in the ALS model.
Should be non-negative.
Defaults to 0.01.
- als_maxitint, optional
Maximum number of iterations for solving the ALS model.
Defaults to 20.
- als_randomstateint, optional
Specifies the seed of the random number generator used in the training of ALS model:
0: Uses the current time as the seed,
Others: Uses the specified value as the seed.
Defaults to 0.
- als_exit_thresholdfloat, optional
Specify a value for stopping the training of ALS model. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.
0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.
Defaults to 0.
- als_exit_intervalint, optional
Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified
exit_threshold
is reached.Defaults to 5.
- als_linsolver{'cholesky', 'cg'}, optional
Linear system solver for the ALS model.
'cholesky' is usually much faster.
'cg' is recommended when
als_factors
is large.
Defaults to 'cholesky'.
- als_maxitint, optional
Specifies the maximum number of iterations for cg algorithm.
Invoked only when the 'cg' is the chosen linear system solver for ALS.
Defaults to 3.
- als_centeringbool, optional
Whether to center the data by column before training the ALS model.
Defaults to True.
- als_scalingbool, optional
Whether to scale the data by column before training the ALS model.
Defaults to True.
- group_keystr, optional
The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group_params are valid.
This parameter is only valid when
massive
is set as True in class instance initialization.Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.
- group_paramsdict, optional
If massive mode is activated (
massive
is set as True in class instance initialization), input data for regression shall be divided into different groups with different regression parameters applied. This parameter specifies the parameter values of the chosen regression algorithmfunc
w.r.t. different groups in a dict format, where keys corresponding to group_key while values should be a dict for regression algorithm parameter value assignments.An example is as follows:
Valid only when
massive
is set as True in class instance initialization.Defaults to None.
- interval_type{'no', 'confidence', 'prediction'}, optional
Specifies the type of interval to output:
'no': do not calculate and output any interval
'confidence': calculate and output the confidence interval
'prediction': calculate and output the prediction interval
Valid only for either of the following 2 cases:
GLM model with IRLS solver applied(i.e.
func
is specified as 'GLM' andsolver
as 'irls' during class instance initialization).Linear Regression model with json model imported and coefficient covariance information computed (i.e.
func
is specified as 'LinearRegression',json_export
specified as True during class instance initialization, andoutput_coefcov
specified as True during the traning phase).
Defaults to 'no'.
- Returns
- DataFrame
A collection of DataFrames listed as follows:
Prediction result by ignoring the true labels of the input data, structured the same as the result table of predict() function.
error msg if massive is True.
Examples
Example 1 - Linear Regression predict with confidence interval:
>>> bsh_train = conn.table('BOSTON_HOUSING_TRAIN_DATA') >>> bsh_test = conn.table('BOSTON_HOUSING_TEST_DATA') >>> ulr = UnifiedRegression(func='LinearRegression', ... json_export=True)#prediction/confidence interval only available for json model >>> ulr.fit(data=bsh_df, ... key='ID', ... label='MEDV', ... output_coefcov=True)#Set as True to output coefficient interval >>> ulr.predict(data=bsh_test.deselect('MEDV'), ... key='ID', ... significance_level=0.05, ... interval_type='confidence')#Specifies the interval type as confidence
Example 2 - GLM model predict of response with prediction interval:
>>> bsh_train = conn.table('BOSTON_HOUSING_TRAIN_DATA') >>> bsh_test = conn.table('BOSTON_HOUSING_TEST_DATA') >>> uglm = UnifiedRegression(func='GLM', family='gaussian', link='identity') >>> uglm.fit(data=bsh_df, key='ID', label='MEDV') >>> ulr.predict(data=bsh_test.deselect('MEDV'), ... key='ID', ... significance_level=0.05, ... prediction_type='response',#set to 'response' for direct response ... interval_type='prediction')#Specifies the interval type as prediction
- score(data, key=None, features=None, label=None, model=None, prediction_type=None, significance_level=None, handle_missing=None, thread_ratio=None, block_size=None, top_k_attributions=None, attribution_method=None, sample_size=None, random_state=None, ignore_correlation=None, impute=False, strategy=None, strategy_by_col=None, als_factors=None, als_lambda=None, als_maxit=None, als_randomstate=None, als_exit_threshold=None, als_exit_interval=None, als_linsolver=None, als_cg_maxit=None, als_centering=None, als_scaling=None, group_key=None, group_params=None, interval_type=None)
Users can use the score function to evaluate the model quality. In the Unified regression, statistics and metrics are provided to show the model quality.
- Parameters
- dataDataFrame
Data for scoring.
- keystr, optional
Name of the ID column.
In single mode, mandatory if
data
is not indexed, or the index ofdata
contains multiple columns. Defaults to the single index column ofdata
if not provided.In massive mode, defaults to the first-non group key column of data if the index columns of data is not provided. Otherwise, defaults to the second of index columns of data and the first column of index columns is group_key.
- featuresListOfString or str, optional
Names of feature columns.
Defaults to all non-ID, non-label columns if not provided.
- labelstr, optional
Name of the label column.
Defaults to the last non-ID column if not provided.
- modelDataFrame
Fitted regression model.
Defaults to self.model_.
- thread_ratiofloat, optional
Controls the proportion of available threads to use for prediction.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to the PAL's default value.
- prediction_typestr, optional
Specifies the type of prediction. Valid options include:
'response' : direct response (with link)
'link' : linear response (without link)
Valid only for GLM models.
Defaults to 'response'.
- significance_levelfloat, optional
Specifies significance level for the confidence interval and prediction interval.
Valid only for the following 2 cases:
GLM model with IRLS solver applied(i.e.
func
is specified as 'GLM' andsolver
as 'irls' during class instance initialization).Linear Regression model with json model imported(i.e.
func
is specified as 'LinearRegression' andjson_export
as True during class instance initialization).
Defaults to 0.05.
- handle_missingstr, optional
Specifies the way to handle missing values. Valid options include:
'skip' : skip rows with missing values
'fill_zero' : replace missing values with 0.
Valid only for GLM models.
Defaults to 'fill_zero'.
- block_sizeint, optional
Specifies the number of data loaded per time during scoring.
0: load all data once
Others: the specified number
This parameter is for reducing memory consumption, especially as the predict data is huge, or it consists of a large number of missing independent variables. However, you might lose some efficiency.
Valid only for RandomDecisionTree models.
Defaults to 0.
- top_k_attributionsint, optional
Specifies the number of features with highest attributions to output.
Defaults to 10.
- attribution_method{'no', 'saabas', 'tree-shap'}, optional
Specifies which method to use for model reasoning.
'no' : No reasoning
'saabas' : Saabas method
'tree-shap' : Tree SHAP method
Valid only for tree-based models, i.e. DecisionTree, RandomDecisionTree and HybridGradientBoostingTree models.
Defaults to 'tree-shap'.
- sample_sizeint, optional
Specifies the number of sampled combinations of features.
0 : Heuristically determined by algorithm
Others : The specified sample size
Valid only for Exponential Regression, GLM, Linear Regression, MLP and Support Vector Regression.
Defaults to 0.
- random_stateint, optional
Specifies the seed for random number generator when sampling the combination of features.
0 : User current time as seed
Others : The actual seed
Valid only for Exponential Regression, GLM, Linear Regression, MLP and Support Vector Regression.
Defaults to 0.
- ignore_correlationbool, optional
Specifies whether or not to ignore the correlation between the features.
Valid only for Exponential Regression, GLM and Linear Regression.
Defaults to False.
- imputebool, optional
Specifies whether or not to impute missing values in
data
.Defaults to False.
- strategy{'non', 'most_frequent-mean', 'most_frequent-median', 'most_frequent-zero', 'most_frequent-als', 'delete'}, optional
Specifies the overall imputation strategy for
data
.'non' : No imputation for all columns.
'most_frequent-mean' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its mean.
'most_frequent-median' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in any numerical column by its median.
'most_frequent-zero' : Replacing missing values in any categorical column by its most frequently observed value, and missing values in all numerical columns by zeros.
'most_frequent-als' : Replacing missing values in any categorical column by its most frequently observed value, and filling the missing values in all numerical columns via a matrix completion technique called alternating least squares.
'delete' : Delete all rows with missing values.
Valid only when
impute
is True.Defaults to 'most_frequent-mean'.
- strategy_by_colListOfTuples, optional
Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation.
Each tuple in the list should contain at least two elements, such that:
the 1st element is the name of a column;
the 2nd element is the imputation strategy of that column, valid strategies include: 'non', 'delete', 'most_frequent', 'categorical_const', 'mean', 'median', 'numerical_const', 'als'.
If the imputation strategy is 'categorical_const' or 'numerical_const', then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column.
- An example for illustration:
[('V1', 'categorical_const', '0'), ('V5','median')]
Valid only when
impute
is True.No default value.
Note
The following parameters all have pre-fix 'als_', and are invoked only when 'als' is selected as a valid imputation strategy in either
strategy
orstrategy_by_col
. Those parameters are for setting up the alternating-least-square(ALS) model for data imputation.- als_factorsint, optional
Length of factor vectors in the ALS model.
It should be less than the number of numerical columns, so that the imputation results would be meaningful.
Defaults to 3.
- als_lambdafloat, optional
L2 regularization applied to the factors in the ALS model.
Should be non-negative.
Defaults to 0.01.
- als_maxitint, optional
Maximum number of iterations for solving the ALS model.
Defaults to 20.
- als_randomstateint, optional
Specifies the seed of the random number generator used in the training of ALS model:
0: Uses the current time as the seed,
Others: Uses the specified value as the seed.
Defaults to 0.
- als_exit_thresholdfloat, optional
Specify a value for stopping the training of ALS model. If the improvement of the cost function of the ALS model is less than this value between consecutive checks, then the training process will exit.
0 means there is no checking of the objective value when running the algorithms, and it stops till the maximum number of iterations has been reached.
Defaults to 0.
- als_exit_intervalint, optional
Specify the number of iterations between consecutive checking of cost functions for the ALS model, so that one can see if the pre-specified
exit_threshold
is reached.Defaults to 5.
- als_linsolver{'cholesky', 'cg'}, optional
Linear system solver for the ALS model.
'cholesky' is usually much faster.
'cg' is recommended when
als_factors
is large.
Defaults to 'cholesky'.
- als_maxitint, optional
Specifies the maximum number of iterations for cg algorithm.
Invoked only when the 'cg' is the chosen linear system solver for ALS.
Defaults to 3.
- als_centeringbool, optional
Whether to center the data by column before training the ALS model.
Defaults to True.
- als_scalingbool, optional
Whether to scale the data by column before training the ALS model.
Defaults to True.
- group_keystr, optional
The column of group_key. Data type can be INT or NVARCHAR/VARCHAR. If data type is INT, only parameters set in the group_params are valid.
This parameter is only valid when
massive
is set as True in class instance initialization.Defaults to the first column of data if the index columns of data is not provided. Otherwise, defaults to the first column of index columns.
- group_paramsdict, optional
If massive mode is activated (
massive
is set as True in class instance initialization), input data for regression shall be divided into different groups with different regression parameters applied. This parameter specifies the parameter values of the chosen regression algorithmfunc
w.r.t. different groups in a dict format, where keys corresponding to group_key while values should be a dict for regression algorithm parameter value assignments.An example is as follows:
Valid only when
massive
is set True in class instance initialization.Defaults to None.
- interval_type{'no', 'confidence', 'prediction'}, optional
Specifies the type of interval to output:
'no': do not calculate and output any interval
'confidence': calculate and output the confidence interval
'prediction': calculate and output the prediction interval
Valid only for either of the following 2 cases:
GLM model with IRLS solver applied(i.e.
func
is specified as 'GLM' andsolver
as 'irls' during class instance initialization).Linear Regression model with json model imported and coefficient covariance information computed (i.e.
func
is specified as 'LinearRegression',json_export
specified as True during class instance initialization, andoutput_coefcov
specified as True during the traning phase).
- Returns
- DataFrame
A collection of DataFrames listed as follows:
Prediction result by ignoring the true labels of the input data, structured the same as the result table of predict() function.
Statistics results
error message if massive is True.
Examples
Example 1 - Linear Regression scoring with prediction interval:
>>> bsh_train = conn.table('BOSTON_HOUSING_TRAIN_DATA') >>> bsh_test = conn.table('BOSTON_HOUSING_TEST_DATA') >>> ulr = UnifiedRegression(func='LinearRegression', ... json_export=True)#prediction/confidence interval only available for json model >>> ulr.fit(data=bsh_df, ... key='ID', ... label='MEDV', ... output_coefcov=True)#Set as True to output interval >>> ulr.predict(data=bsh_test.deselect('MEDV'), ... key='ID', ... significance_level=0.05, ... interval_type='prediction')#Specifies the interval type as prediction
Example 2 - GLM model predict of linear response with confidence interval:
>>> bsh_train = conn.table('BOSTON_HOUSING_TRAIN_DATA') >>> bsh_test = conn.table('BOSTON_HOUSING_TEST_DATA') >>> uglm = UnifiedRegression(func='GLM', family='gaussian', link='identity') >>> uglm.fit(data=bsh_df, key='ID', label='MEDV') >>> ulr.predict(data=bsh_test.deselect('MEDV'), ... key='ID', ... significance_level=0.05, ... prediction_type='link',#set as 'link' for linear response ... interval_type='confidence')#Specifies the interval type as confidence
- build_report()
Build model report.
- create_model_state(model=None, function=None, pal_funcname='PAL_UNIFIED_REGRESSION', state_description=None, force=False)
Create PAL model state.
- Parameters
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function name of the regression algorithm.
Valid options include:
'SVM' : Support Vector Regression
'MLP' : Multilayer Perceptron Regression
'DT' : Decision Tree Regression
'HGBT' : Hybrid Gradient Boosting Tree Regression
'MLR' : Multiple Linear Regression
'RDT' : Random Decision Trees Regression
Defaults to self.real_func.
Note
The default value could be invalid. In such case, a ValueError shall be thrown.
- pal_funcnameint or str, optional
PAL function name.
Defaults to 'PAL_UNIFIED_REGRESSION'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
Set the model state by state information.
- Parameters
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- generate_html_report(filename)
Save model report as a html file.
- Parameters
- filenamestr
Html file name.
- generate_notebook_iframe_report()
Render model report as a notebook iframe.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- set_framework_version(framework_version)
Switch v1/v2 version of report.
- Parameters
- framework_version{'v2', 'v1'}, optional
v2: using report builder framework. v1: using pure html template.
Defaults to 'v2'.