LinearRegression
- class hana_ml.algorithms.pal.linear_model.LinearRegression(solver=None, var_select=None, features_must_select=None, intercept=True, alpha_to_enter=None, alpha_to_remove=None, enet_lambda=None, enet_alpha=None, max_iter=None, tol=None, pho=None, stat_inf=False, adjusted_r2=False, dw_test=False, reset_test=None, bp_test=False, ks_test=False, thread_ratio=None, categorical_variable=None, pmml_export=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, handle_missing=None, json_export=None, precompute_lms_sketch=None, stable_sketch_alg=None, sparse_sketch_alg=None, resource=None, max_resource=None, reduction_rate=None, aggressive_elimination=None, ps_verbose=None, min_resource_rate=None, onehot_min_frequency=None, onehot_max_categories=None)
-
Linear regression is an approach to model the linear relationship between a variable, usually referred to as dependent variable, and one or more variables, usually referred to as independent variables, denoted as predictor vector.
Note
Linear Regression supports model evaluation and parameter selection, explanations of this topic can be seen in Model Evaluation and Parameter Selection.
- Parameters:
-
- solver{'QR', 'SVD', 'CD', 'Cholesky', 'ADMM'}, optional
-
Algorithms to use to solve the least square problem. Case-insensitive.
-
'QR': QR decomposition (numerically stable, but fails when A is rank-deficient).
-
'SVD': singular value decomposition (numerically stable and can handle rank deficiency but computationally expensive).
-
'CD': cyclical coordinate descent method to solve elastic net regularized multiple linear regression.
-
'Cholesky': Cholesky decomposition (fast but numerically unstable).
-
'ADMM': Alternating direction method of multipliers (ADMM) to solve elastic net regularized multiple linear regression. This method is faster than the cyclical coordinate descent method in many cases and recommended.
'CD' and 'ADMM' are supported only when
var_select
is 'all'.Defaults to 'QR' decomposition.
-
- var_select{'all', 'forward', 'backward', 'stepwise'}, optional
-
Method to perform variable selection.
-
'all': all variables are included.
-
'forward': forward selection.
-
'backward': backward selection.
-
'stepwise': stepwise selection.
'forward', 'backward' and 'stepwise' are supported only when
solver
is not 'CD', 'ADMM' andintercept
is True.Defaults to 'all'.
-
- features_must_select: str or a list of str, optional
-
Specifies the column name that needs to be included in the final training model when executing the variable selection.
This parameter can be specified multiple times, each time with one column name as feature.
Only valid when
var_select
is not 'all'.Note that This parameter is a hint. There are exceptional cases that a specified mandatory feature is excluded in the final model.
For instance, some mandatory features can be represented as a linear combination of other features, among which some are also mandatory features.
No default value.
- interceptbool, optional
-
-
True : include the intercept term in the model.
-
False : ignore the intercept.
Defaults to True.
-
- alpha_to_enterfloat, optional
-
P-value for forward and stepwise selection.
Valid only when
var_select
is 'forward' or 'stepwise'.Defaults to 0.05 when
var_select
is 'forward', 0.15 whenvar_select
is 'stepwise'. - alpha_to_removefloat, optional
-
P-value for backward and stepwise selection.
Valid only when
var_select
is 'backward' or 'stepwise'.Defaults to 0.1 when var_select` is 'backward', and 0.15 when
var_select
is 'stepwise'. - enet_lambdafloat, optional
-
Penalized weight. Value should be greater than or equal to 0.
Valid only when
solver
is 'CD' or 'ADMM'. - enet_alphafloat, optional
-
Elastic net mixing parameter.
Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively.
Valid only when
solver
is 'CD' or 'ADMM'.Defaults to 1.0.
- max_iterint, optional
-
Maximum number of passes over training data.
If convergence is not reached after the specified number of iterations, an error will be generated.
Valid only when
solver
is 'CD' or 'ADMM'.Defaults to 1e5.
- tolfloat, optional
-
Convergence threshold for coordinate descent.
Valid only when
solver
is 'CD'.Defaults to 1.0e-7.
- phofloat, optional
-
Step size for ADMM. Generally, it should be greater than 1.
Valid only when
solver
is 'ADMM'.Defaults to 1.8.
- stat_infbool, optional
-
If true, output t-value and Pr(>|t|) of coefficients.
Defaults to False.
- adjusted_r2bool, optional
-
If true, include the adjusted R2 value in statistics.
Defaults to False.
- dw_testbool, optional
-
If true, conduct Durbin-Watson test under null hypothesis that errors do not follow a first order autoregressive process.
Not available if elastic net regularization is enabled or
intercept
is False.Defaults to False.
- reset_testint, optional
-
Specifies the order of Ramsey RESET test.
Ramsey RESET test with power of variables ranging from 2 to this value (greater than 1) will be conducted.
Value 1 means RESET test will not be conducted. Not available if elastic net regularization is enabled or
intercept
is False.Defaults to 1.
- bp_testbool, optional
-
If true, conduct Breusch-Pagan test under null hypothesis that homoscedasticity is satisfied.
Not available if elastic net regularization is enabled or
intercept
is False.Defaults to False.
- ks_testbool, optional
-
If true, conduct Kolmogorov-Smirnov normality test under null hypothesis that errors follow a normal distribution.
Not available if elastic net regularization is enabled or
intercept
is False.Defaults to False.
- thread_ratiofloat, optional
-
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use. Valid only when
solver
is 'QR', 'CD', 'Cholesky' or 'ADMM'.Defaults to 0.0.
- categorical_variablestr or ist of str, optional
-
Specifies INTEGER columns specified that should be be treated as categorical.
Other INTEGER columns will be treated as continuous.
- pmml_export{'no', 'multi-row'}, optional
-
Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.
-
'no' or not provided: Does not export multiple linear regression model in PMML.
-
'multi-row': Exports a PMML model, exports multiple linear regression model in PMML. The maximum length of each row is 5000 characters.
Currently either PMML or JSON format model can be exported. JSON format is preferred if both formats are to be exported.
Defaults to 'no'.
-
- resampling_method{'cv', 'bootstrap', 'cv_sha', 'bootstrap_sha', 'cv_hyperband', 'bootstrap_hyperband'}, optional
-
Specifies the resampling method for model evaluation/parameter selection.
If no value is specified for this parameter, neither model evaluation
nor parameter selection is activated.
Must be set together with
evaluation_metric
.No default value.
Note
Resampling methods that end with 'sha' or 'hyperband' are used for parameter selection only, not for model evaluation.
- evaluation_metric{'rmse'}, optional
-
Specifies the evaluation metric for model evaluation or parameter selection.
Must be set together with
resampling_method
.No default value.
- fold_numint, optional
-
Specifies the fold number for the cross validation method. Mandatory and valid only when
resampling_method
is set to 'cv', 'cv_sha' or 'cv_hyperband'.No default value.
- repeat_timesint, optional
-
Specifies the number of repeat times for resampling.
Defaults to 1.
- search_strategy{'grid', 'random'}, optional
-
Specifies the search strategy for parameter selection.
Mandatory if
resampling_method
is specified and ends with 'sha'.Defaults to 'random' and cannot be changed if
resampling_method
is specified and ends with 'hyperband'; otherwise no default value, and parameter selection cannot be carried out if not specified. - random_search_timesint, optional
-
Specifies the number of times to randomly select candidate parameters for selection.
Mandatory and valid when
search_strategy
is set to 'random', or whenresampling_method
is 'cv_hyperband' or 'bootstrap_hyperband'.No default value.
- random_stateint, optional
-
Specifies the seed for random generation. Use system time when 0 is specified.
Defaults to 0.
- timeoutint, optional
-
Specifies maximum running time for model evaluation or parameter
selection, in seconds. No timeout when 0 is specified.
Defaults to 0.
- progress_indicator_idstr, optional
-
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
No default value.
- param_valuesdict or list of tuples, optional
-
Specifies values of specific parameters to be selected.
Valid only when
resampling_method
andsearch_strategy
are both specified.Specified parameters could be
enet_lambda
andenet_alpha
.No default value.
- param_rangedict or list of tuples, optional
-
Specifies range of specific parameters to be selected.
Valid only when
resampling_method
andsearch_strategy
are both specified.Specified parameters could be
enet_lambda
,enet_alpha
.No default value.
- handle_missingbool, optional
-
-
True : handle missing values.
-
False : do not handle missing values.
Defaults to True.
-
- json_exportbool, optional
-
-
False : Does not export multiple linear regression model in JSON.
-
True : Exports multiple linear regression model in JSON.
Currently either PMML or JSON format model can be exported. JSON format is preferred if both formats are to be exported.
Defaults to False.
-
- precompute_lms_sketchbool, optional
-
-
False : Do not perform LMS sketch.
-
True : Performs LMS sketch.
LMS sketch will only perform when
resampling_method
is set, and the size ofdata
is larger than the number of features.Defaults to True.
-
- stable_sketch_algbool, optional
-
When computing LMS sketch, there are two algorithms to choose. One algorithm is more numerical stable than the other one, but is slower.
-
False : Do not use stable algorithm.
-
True : Uses stable algorithm.
Only valid when LMS sketch is performed (
precompute_lms_sketch = True
) andsparse_sketch_alg = False
.Defaults to True.
-
- sparse_sketch_algbool, optional
-
This is specific LMS sketch algorithm to cope with sparse data.
-
False : Do not use sparse LMS sketch algorithm.
-
True : Uses sparse LMS sketch algorithm.
Only valid when LMS sketch is performed (
precompute_lms_sketch = True
).Defaults to False.
-
- resourcestr, optional
-
Specifies the resource type used in successive-halving and hyperband algorithm for parameter selection.
Currently the only valid option is 'max_iter'.
Mandatory and valid only when
resampling_method
is set as 'cv_sha', 'bootstrap_sha', 'cv_hyperband' or 'bootstrap_hyperband'. - max_resourceint, optional
-
Maximum allowed resource budget for single hyper-parameter candidate, must be greater than 0.
Mandatory and valid only wen
resource
is set. - reduction_ratefloat, optional
-
Specifies the reduction rate of available size of hyper-parameter candidates. For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0
Valid only when
resource
is set.Defaults to 3.0.
- aggressive_eliminationbool, optional
-
Specifies whether to perform aggressive elimination behavior for successive-halving algorithm or not.
When set to True, it will eliminate more parameter candidates than expected(defined via
reduction_rate
). This can enhance the run-time performance but could result in sub-optimal hyper-parameter candidate.Valid only when
resampling_method
is 'cv_sha' or 'bootstrap_sha'.Defaults to False.
- ps_verbosebool, optional
-
Specifies whether to output optimal hyper-parameter and all evaluation statistics of related hyper-parameter candidates in attribute statistics_ or not.
Defaults to True.
- min_resource_ratefloat, optional
-
Specifies the minimum required resource budget compared to maximum resource for single hyper-parameter candidate. Valid value should be greater than or equal to 0, but less than 1.
Valid only when
resource
is set.Defaults to 0.
- onehot_min_frequencyint, optional
-
Specifies the minimum frequency below which a category will be considered infrequent.
Defaults to 1.
- onehot_max_categoriesint, optional
-
Specifies an upper limit to the number of output features for each input feature. It includes the feature that combines infrequent categories.
Defaults to 0.
Examples
>>> lr = LinearRegression() >>> lr.fit(data=df_train, key='ID', label='Y') >>> lr.predict(data=df_predict, key='ID').collect()
Biased Linear Model with Elastic-net Regularization
Relevant parameters:
enet_alpha
,enet_lambda
>>> lr = LinearRegression(solver='ADMM', enet_lambda=0.003194, enet_alpha=0.95) >>> lr.fit(data=df_train)Biased Linear Model with Variable Selection
Relevant parameters:
var_select
,features_must_select
,alpha_to_enter
,alpha_to_remove
>>> lr = LinearRegression(var_select=True, alpha_to_enter=0.1)- Attributes:
-
- coefficients_DataFrame
-
Fitted regression coefficients.
- fitted_DataFrame
-
Predicted dependent variable values for training data. Set to None if the training data has no row IDs.
- statistics_DataFrame
-
Regression-related statistics, such as mean squared error.
- optim_param_DataFrame
-
If parameter selection is enabled, the optimal parameters will be selected.
- pmml_DataFrame
-
PMML model. (deprecate as JSON format is also supported in the model). Please use semistructured_result_ shown below to get the model.
- semistructured_result_DataFrame
-
Linear regression model in PMML or JSON format.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, label, ...])Fit the model to the training dataset.
predict
(data[, key, features])Predict dependent variable values based on fitted model.
score
(data[, key, features, label])Returns the coefficient of determination R2 of the prediction.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, label=None, categorical_variable=None)
-
Fit the model to the training dataset.
- Parameters:
-
- dataDataFrame
-
Training data.
- keystr, optional
-
Name of the ID column.
If
key
is not provided, then:-
if
data
is indexed by a single column, thenkey
defaults to that index column; -
otherwise, it is assumed that
data
contains no ID column.
-
- featuresa list of str, optional
-
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns. - labelstr, optional
-
Name of the dependent variable. If
label
is not provided, it defaults to the last column. - categorical_variablestr or a list of str, optional
-
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- Returns:
-
- A fitted object of class "LinearRegression".
- predict(data, key=None, features=None)
-
Predict dependent variable values based on fitted model.
- Parameters:
-
- dataDataFrame
-
Independent variable values to predict for.
- keystr, optional
-
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided. - featuresa list of str, optional
-
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.
- Returns:
-
- DataFrame
-
Predicted values, structured as follows:
-
ID column: with same name and type as
data
's ID column. -
VALUE: type DOUBLE, representing predicted values.
-
- score(data, key=None, features=None, label=None)
-
Returns the coefficient of determination R2 of the prediction.
- Parameters:
-
- dataDataFrame
-
Data on which to assess model performance.
- keystr, optional
-
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided. - featuresa list of str, optional
-
Names of the feature columns.
If
features
is not provided, it defaults all non-ID, non-label columns. - labelstr, optional
-
Name of the dependent variable.
If
label
is not provided, it defaults to the last column.
- Returns:
-
- float
-
Returns the coefficient of determination R2 of the prediction.
- create_model_state(model=None, function=None, pal_funcname='PAL_LINEAR_REGRESSION', state_description=None, force=False)
-
Create PAL model state.
- Parameters:
-
- modelDataFrame, optional
-
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
-
Specify the function in the unified API.
A placeholder parameter, not effective for Linear Regression.
- pal_funcnameint or str, optional
-
PAL function name. Must be a valid PAL procedure that supports model state.
Defaults to 'PAL_LINEAR_REGRESSION'.
- state_descriptionstr, optional
-
Description of the state as model container.
Defaults to None.
- forcebool, optional
-
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
-
Set the model state by state information.
- Parameters:
-
- state: DataFrame or dict
-
If state is DataFrame, it has the following structure:
-
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
-
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
-
- delete_model_state(state=None)
-
Delete PAL model state.
- Parameters:
-
- stateDataFrame, optional
-
Specified the state.
Defaults to self.state.
Inherited Methods from PALBase
Besides those methods mentioned above, the LinearRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.