LinearRegression

class hana_ml.algorithms.pal.linear_model.LinearRegression(solver=None, var_select=None, features_must_select=None, intercept=True, alpha_to_enter=None, alpha_to_remove=None, enet_lambda=None, enet_alpha=None, max_iter=None, tol=None, pho=None, stat_inf=False, adjusted_r2=False, dw_test=False, reset_test=None, bp_test=False, ks_test=False, thread_ratio=None, categorical_variable=None, pmml_export=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, handle_missing=None, json_export=None, precompute_lms_sketch=None, stable_sketch_alg=None, sparse_sketch_alg=None, resource=None, max_resource=None, reduction_rate=None, aggressive_elimination=None, ps_verbose=None, min_resource_rate=None)

Linear regression is an approach to model the linear relationship between a variable, usually referred to as dependent variable, and one or more variables, usually referred to as independent variables, denoted as predictor vector.

Note

Linear Regression supports model evaluation and parameter selection, explanations of this topic can be seen in Model Evaluation and Parameter Selection.

Parameters:
solver{'QR', 'SVD', 'CD', 'Cholesky', 'ADMM'}, optional

Algorithms to use to solve the least square problem. Case-insensitive.

  • 'QR': QR decomposition (numerically stable, but fails when A is rank-deficient).

  • 'SVD': singular value decomposition (numerically stable and can handle rank deficiency but computationally expensive).

  • 'CD': cyclical coordinate descent method to solve elastic net regularized multiple linear regression.

  • 'Cholesky': Cholesky decomposition (fast but numerically unstable).

  • 'ADMM': Alternating direction method of multipliers (ADMM) to solve elastic net regularized multiple linear regression. This method is faster than the cyclical coordinate descent method in many cases and recommended.

'CD' and 'ADMM' are supported only when var_select is 'all'.

Defaults to 'QR' decomposition.

var_select{'all', 'forward', 'backward', 'stepwise'}, optional

Method to perform variable selection.

  • 'all': all variables are included.

  • 'forward': forward selection.

  • 'backward': backward selection.

  • 'stepwise': stepwise selection.

'forward', 'backward' and 'stepwise' are supported only when solver is not 'CD', 'ADMM' and intercept is True.

Defaults to 'all'.

features_must_select: str or list of str, optional

Specifies the column name that needs to be included in the final training model when executing the variable selection.

This parameter can be specified multiple times, each time with one column name as feature.

Only valid when var_select is not 'all'.

Note that This parameter is a hint. There are exceptional cases that a specified mandatory feature is excluded in the final model.

For instance, some mandatory features can be represented as a linear combination of other features, among which some are also mandatory features.

No default value.

interceptbool, optional
  • True : include the intercept term in the model.

  • False : ignore the intercept.

Defaults to True.

alpha_to_enterfloat, optional

P-value for forward and stepwise selection.

Valid only when var_select is 'forward' or 'stepwise'.

Defaults to 0.05 when var_select is 'forward', 0.15 when var_select is 'stepwise'.

alpha_to_removefloat, optional

P-value for backward and stepwise selection.

Valid only when var_select is 'backward' or 'stepwise'.

Defaults to 0.1 when var_select` is 'backward', and 0.15 when var_select is 'stepwise'.

enet_lambdafloat, optional

Penalized weight. Value should be greater than or equal to 0.

Valid only when solver is 'CD' or 'ADMM'.

enet_alphafloat, optional

Elastic net mixing parameter.

Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively.

Valid only when solver is 'CD' or 'ADMM'.

Defaults to 1.0.

max_iterint, optional

Maximum number of passes over training data.

If convergence is not reached after the specified number of iterations, an error will be generated.

Valid only when solver is 'CD' or 'ADMM'.

Defaults to 1e5.

tolfloat, optional

Convergence threshold for coordinate descent.

Valid only when solver is 'CD'.

Defaults to 1.0e-7.

phofloat, optional

Step size for ADMM. Generally, it should be greater than 1.

Valid only when solver is 'ADMM'.

Defaults to 1.8.

stat_infbool, optional

If true, output t-value and Pr(>|t|) of coefficients.

Defaults to False.

adjusted_r2bool, optional

If true, include the adjusted R2 value in statistics.

Defaults to False.

dw_testbool, optional

If true, conduct Durbin-Watson test under null hypothesis that errors do not follow a first order autoregressive process.

Not available if elastic net regularization is enabled or intercept is False.

Defaults to False.

reset_testint, optional

Specifies the order of Ramsey RESET test.

Ramsey RESET test with power of variables ranging from 2 to this value (greater than 1) will be conducted.

Value 1 means RESET test will not be conducted. Not available if elastic net regularization is enabled or intercept is False.

Defaults to 1.

bp_testbool, optional

If true, conduct Breusch-Pagan test under null hypothesis that homoscedasticity is satisfied.

Not available if elastic net regularization is enabled or intercept is False.

Defaults to False.

ks_testbool, optional

If true, conduct Kolmogorov-Smirnov normality test under null hypothesis that errors follow a normal distribution.

Not available if elastic net regularization is enabled or intercept is False.

Defaults to False.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Valid only when solver is 'QR', 'CD', 'Cholesky' or 'ADMM'.

Defaults to 0.0.

categorical_variablestr or ist of str, optional

Specifies INTEGER columns specified that should be be treated as categorical.

Other INTEGER columns will be treated as continuous.

pmml_export{'no', 'multi-row'}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • 'no' or not provided: Does not export multiple linear regression model in PMML.

  • 'multi-row': Exports a PMML model, exports multiple linear regression model in PMML. The maximum length of each row is 5000 characters.

Currently either PMML or JSON format model can be exported. JSON format is preferred if both formats are to be exported.

Defaults to 'no'.

resampling_method{'cv', 'bootstrap', 'cv_sha', 'bootstrap_sha', 'cv_hyperband', 'bootstrap_hyperband'}, optional

Specifies the resampling method for model evaluation/parameter selection.

If no value is specified for this parameter, neither model evaluation

nor parameter selection is activated.

Must be set together with evaluation_metric.

No default value.

Note

Resampling methods that end with 'sha' or 'hyperband' are used for parameter selection only, not for model evaluation.

evaluation_metric{'rmse'}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

Must be set together with resampling_method.

No default value.

fold_numint, optional

Specifies the fold number for the cross validation method. Mandatory and valid only when resampling_method is set to 'cv', 'cv_sha' or 'cv_hyperband'.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Defaults to 1.

search_strategy{'grid', 'random'}, optional

Specifies the search strategy for parameter selection.

Mandatory if resampling_method is specified and ends with 'sha'.

Defaults to 'random' and cannot be changed if resampling_method is specified and ends with 'hyperband'; otherwise no default value, and parameter selection cannot be carried out if not specified.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is set to 'random', or when resampling_method is 'cv_hyperband' or 'bootstrap_hyperband'.

No default value.

random_stateint, optional

Specifies the seed for random generation. Use system time when 0 is specified.

Defaults to 0.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter

selection, in seconds. No timeout when 0 is specified.

Defaults to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection.

No progress indicator is active if no value is provided.

No default value.

param_valuesdict or list of tuples, optional

Specifies values of specific parameters to be selected.

Valid only when resampling_method and search_strategy are both specified.

Specified parameters could be enet_lambda and enet_alpha.

No default value.

param_rangedict or list of tuples, optional

Specifies range of specific parameters to be selected.

Valid only when resampling_method and search_strategy are both specified.

Specified parameters could be enet_lambda, enet_alpha.

No default value.

handle_missingbool, optional
  • True : handle missing values.

  • False : do not handle missing values.

Defaults to True.

json_exportbool, optional
  • False : Does not export multiple linear regression model in JSON.

  • True : Exports multiple linear regression model in JSON.

Currently either PMML or JSON format model can be exported. JSON format is preferred if both formats are to be exported.

Defaults to False.

precompute_lms_sketchbool, optional
  • False : Do not perform LMS sketch.

  • True : Performs LMS sketch.

LMS sketch will only perform when resampling_method is set, and the size of data is larger than the number of features.

Defaults to True.

stable_sketch_algbool, optional

When computing LMS sketch, there are two algorithms to choose. One algorithm is more numerical stable than the other one, but is slower.

  • False : Do not use stable algorithm.

  • True : Uses stable algorithm.

Only valid when LMS sketch is performed (precompute_lms_sketch = True) and sparse_sketch_alg is False.

Defaults to True.

sparse_sketch_algbool, optional

This is specific LMS sketch algorithm to cope with sparse data.

  • False : Do not use sparse LMS sketch algorithm.

  • True : Uses sparse LMS sketch algorithm.

Only valid when LMS sketch is performed (precompute_lms_sketch = True).

Defaults to False.

resourcestr, optional

Specifies the resource type used in successive-halving and hyperband algorithm for parameter selection.

Currently the only valid option is 'max_iter'.

Mandatory and valid only when resampling_method is set as 'cv_sha', 'bootstrap_sha', 'cv_hyperband' or 'bootstrap_hyperband'.

max_resourceint, optional

Maximum allowed resource budget for single hyper-parameter candidate, must be greater than 0.

Mandatory and valid only wen resource is set.

reduction_ratefloat, optional

Specifies the reduction rate of available size of hyper-parameter candidates. For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0

Valid only when resource is set.

Defaults to 3.0.

aggressive_eliminationbool, optional

Specifies whether to perform aggressive elimination behavior for successive-halving algorithm or not.

When set to True, it will eliminate more parameter candidates than expected(defined via reduction_rate). This can enhance the run-time performance but could result in sub-optimal hyper-parameter candidate.

Valid only when resampling_method is 'cv_sha' or 'bootstrap_sha'.

Defaults to False.

ps_verbosebool, optional

Specifies whether to output optimal hyper-parameter and all evaluation statistics of related hyper-parameter candidates in attribute statistics_ or not.

Defaults to True.

min_resource_ratefloat, optional

Specifies the minimum required resource budget compared to maximum resource for single hyper-parameter candidate. Valid value should be greater than or equal to 0, but less than 1.

Valid only when resource is set.

Defaults to 0.

Examples

Training data:

>>> df.collect()
  ID       Y    X1 X2  X3
0  0  -6.879  0.00  A   1
1  1  -3.449  0.50  A   1
2  2   6.635  0.54  B   1
3  3  11.844  1.04  B   1
4  4   2.786  1.50  A   1
5  5   2.389  0.04  B   2
6  6  -0.011  2.00  A   2
7  7   8.839  2.04  B   2
8  8   4.689  1.54  B   1
9  9  -5.507  1.00  A   2

Training the model:

>>> lr = LinearRegression(thread_ratio=0.5,
...                       categorical_variable=["X3"])
>>> lr.fit(data=df, key='ID', label='Y')

Prediction:

>>> df2.collect()
   ID     X1 X2  X3
0   0  1.690  B   1
1   1  0.054  B   2
2   2  0.123  A   2
3   3  1.980  A   1
4   4  0.563  A   1
>>> lr.predict(data=df2, key='ID').collect()
   ID      VALUE
0   0  10.314760
1   1   1.685926
2   2  -7.409561
3   3   2.021592
4   4  -3.122685

Biased Linear Model with Elastic-net Regularization

Relevant parameters: enet_alpha, enet_lambda

Training data:

>>> df.collect()
  ID  V1     V2    V3   V4
0  0 1.2    0.1 0.205  0.9
1  1 0.2 -1.705  -3.4  1.7
2  2 1.1    0.4   0.8  0.5
3  3 1.1    0.1 0.201  0.8
4  4 0.3 -0.306  -0.6  0.2

Class initialization and model training:

>>> lr = LinearRegression(solver='ADMM', enet_lambda=0.003194, enet_alpha=0.95)
>>> lr.fit(data = df)

Biased Linear Model with Variable Selection

Relevant parameters: var_select, features_must_select, alpha_to_enter, alpha_to_remove

>>> lr = LinearRegression(var_select=True, alpha_to_enter=0.1)
Attributes:
coefficients_DataFrame

Fitted regression coefficients.

fitted_DataFrame

Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_DataFrame

Regression-related statistics, such as mean squared error.

optim_param_DataFrame

If parameter selection is enabled, the optimal parameters will be selected.

pmml_DataFrame

PMML model. (deprecate as JSON format is also supported in the model). Please use semistructured_result_ shown below to get the model.

semistructured_result_DataFrame

Linear regression model in PMML or JSON format.

Methods

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

fit(data[, key, features, label, ...])

Fit regression model based on training data.

predict(data[, key, features])

Predict dependent variable values based on fitted model.

score(data[, key, features, label])

Returns the coefficient of determination R2 of the prediction.

set_model_state(state)

Set the model state by state information.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Fit regression model based on training data.

Parameters:
dataDataFrame

Training data.

keystr, optional

Name of the ID column.

If key is not provided, then:

  • if data is indexed by a single column, then key defaults to that index column;

  • otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable. If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Returns:
LinearRegression

A fitted object.

predict(data, key=None, features=None)

Predict dependent variable values based on fitted model.

Parameters:
dataDataFrame

Independent variable values to predict for.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

Returns:
DataFrame

Predicted values, structured as follows:

  • ID column: with same name and type as data 's ID column.

  • VALUE: type DOUBLE, representing predicted values.

score(data, key=None, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters:
dataDataFrame

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

If label is not provided, it defaults to the last column.

Returns:
float

Returns the coefficient of determination R2 of the prediction.

create_model_state(model=None, function=None, pal_funcname='PAL_LINEAR_REGRESSION', state_description=None, force=False)

Create PAL model state.

Parameters:
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for Linear Regression.

pal_funcnameint or str, optional

PAL function name. Must be a valid PAL procedure that supports model state.

Defaults to 'PAL_LINEAR_REGRESSION'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

set_model_state(state)

Set the model state by state information.

Parameters:
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

delete_model_state(state=None)

Delete PAL model state.

Parameters:
stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the LinearRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.