Linear Regression

hanaml.LinearRegression is a R wrapper for SAP HANA PAL linear regression algorithm.

hanaml.LinearRegression(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  formula = NULL,
  solver = NULL,
  var.select = NULL,
  features.must.select = NULL,
  intercept = NULL,
  alpha.to.enter = NULL,
  alpha.to.remove = NULL,
  enet.lambda = NULL,
  enet.alpha = NULL,
  max.iter = NULL,
  tol = NULL,
  pho = NULL,
  stat.inf = NULL,
  adjusted.r2 = NULL,
  dw.test = NULL,
  reset.test = NULL,
  bp.test = NULL,
  ks.test = NULL,
  thread.ratio = NULL,
  categorical.variable = NULL,
  pmml.export = NULL,
  resampling.method = NULL,
  evaluation.metric = NULL,
  fold.num = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  random.state = NULL,
  timeout = NULL,
  progress.indicator.id = NULL,
  parameter.range = NULL,
  parameter.values = NULL,
  handle.missing = NULL,
  json.export = NULL,
  precompute.lms.sketch = NULL,
  stable.sketch.alg = NULL,
  sparse.sketch.alg = NULL,
  resource = NULL,
  max.resource = NULL,
  min.resource.rate = NULL,
  reduction.rate = NULL,
  aggressive.elimination = NULL,
  ps.verbose = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula, or a feature and label combination, but do not provide both.
Defaults to NULL.

solver

c("QR", "SVD", "CD", "Cholesky", "ADMM"), optional
Algorithms to use to solve the least square problem.
"CD" and "ADMM" are supported only when var_select is "all".

"QR": QR decomposition(numerically stable, but fails when A is rank-deficient).
"SVD": singular value decomposition(numerically stable and can handle rank deficiency but computationally expensive).
"CD": cyclical coordinate descent methodto solve elastic net regularized multiple linear regression.
"Cholesky": Cholesky decomposition(fast but numerically unstable).
"ADMM": Alternating direction method of multipliers (ADMM) to solve elastic net regularized multiple linear regression. This method is faster than the cyclical coordinate descent method in many cases and recommended.

Defaults to "QR".

var.select

c("all", "forward", "backward", "stepwise"), optional
Method to perform variable selection.

"all": all variables are included.
"forward": forward selection.
"backward": backward selection.
"stepwise": stepwise selection.

"forward", "backward" and "stepwise" are supported only when solver is "QR", "SVD" or "Cholesky".
Defaults to "all".

features.must.select

character or list/vector of characters, optional
Specifies the column names of data that needs to be included in the final training model when executing variable selection.
Only valid when var.select is not "all".
Note: This parameter is a hint. There are exceptional cases that a specified mandatory feature is excluded in the final model. For instance, some mandatory features can be represented as a linear combination of other features, among which some are also mandatory features.
No default value.

intercept

logical, optional
If TRUE, include the intercept in the model.
Defaults to TRUE.

alpha.to.enter

double, optional
P-value for forward selection. When var.select is 'forward', default value is 0.05; when var.select is "stepwise", default value is 0.15.

alpha.to.remove

double, optional
P-value for backward selection. When var.select is 'backward', default value is 0.1; when var.select is "stepwise", default value is 0.15.

enet.lambda

double, optional
Penalized weight. Value should be greater than or equal to 0.
Valid only when solver is "CD" or "ADMM".

enet.alpha

double, optional
Elastic net mixing parameter. Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively. Valid only when solver is "CD" or "ADMM".
Defaults to 1.0.

max.iter

integer, optional
Maximum number of passes over training data. If convergence is not reached after the specified number of iterations, an error will be generated. Valid only when solver is "CD" or "ADMM".
Defaults to 1e5.

tol

double, optional
Convergence threshold for coordinate descent. Valid only when solver is "CD".
Defaults to 1.0e-7.

pho

double, optional
Step size for ADMM. Generally, it should be greater than 1. Valid only when solver is "ADMM".
Defaults to 1.8.

stat.inf

logical, optional
If TRUE, output t-value and Pr(>|t|) of coefficients.
Defaults to FALSE.

adjusted.r2

logical, optional
If TRUE, include the adjusted R^2 value in statistics.
Defaults to FALSE.

dw.test

logical, optional
If TRUE, conduct Durbin-Watson test under null hypothesis that errors do not follow a first order autoregressive process. Not available if elastic net regularization is enabled or intercept is ignored.
Defaults to FALSE.

reset.test

integer, optional
Specifies the order of Ramsey RESET test. Ramsey RESET test with power of variables ranging from 2 to this value (greater than 1) will be conducted. Value 1 means RESET test will not be conducted. Not available if elastic net regularization is enabled or intercept is ignored.
Defaults to 1.

bp.test

logical, optional
If TRUE, conduct Breusch-Pagan test under null hypothesis that homoscedasticity is satisfied. Not available if elastic net regularization is enabled or intercept is ignored.
Defaults to FALSE.

ks.test

logical, optional
If TRUE, conduct Kolmogorov-Smirnov normality test under null hypothesis that errors follow a normal distribution. Not available if elastic net regularization is enabled or intercept is ignored.
Defaults to FALSE.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

pmml.export

c("no", "multi-row"), optional
Controls whether to output a PMML representation of the model, and how to format the PMML.

"no": Does not export multiple linear regression model in PMML.
"multi-row": Exports a PMML model, exports multiple linear regression model in PMML. The maximum length of each row is 5000 characters.

Currently either PMML or JSON format model can be exported. JSON format is preferred if both formats are to be exported. Default to "no".

resampling.method

character, optional
Specifies the resampling method for model evaluation or parameter selection from the list below:
"cv", "cv_sha", "cv_hyperband", "bootstrap", "bootstrap_sha", "bootstrap_hyperband".
If no value is specified, neither model evaluation nor parameter selection is activated.

evaluation.metric

character, optional
Specifies the evaluation metric for model evaluation or parameter selection.
Currently the only valid option(also the default value) is "rmse".

fold.num

integer, optional
Specifies the fold number for cross-validation(cv).
Mandatory and valid only when resampling.method is one of the following: "cv", "cv_sha", "cv_hyperband".

repeat.times

numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.

param.search.strategy

c("grid", "random"), optional
Specifies the method to activate parameter selection.
If not specified, model parameter selection shall not be triggered.
Defaults to "random" and cannot be changed if resampling.method is set as "cv_hyperband" or "bootstrap_hyperband"; otherwise no default value.

random.search.times

integer, optional
Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid only when param.search.strategy is set as "random", or when resampling.method is set as "cv_hyperband" or "bootstrap_hyperband".

random.state

numeric, optional
Specifies the seed for random generation.
Use system time when 0 is specified.

timeout

integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds.
No timeout when 0 is specified.

progress.indicator.id

character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.

parameter.range

list, optional
Specifies range of the following parameters for parameter selection:
enet.lambda, enet.alpha.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(enet.lambda = c(0.01, 0.01, 0.1)), which means taking enet.lambda values from 0.01 to 0.1 with 0.01 being the step size, i.e. 0.01, 0.02, 0.03, ..., 0.09, 0.1.
If param.search.strategy is "random", then the middle term, i.e. step has no effect and thus can be omitted.

parameter.values

list, optional
Specifies values of the following parameters for parameter selection:
enet.lambda, enet.alpha.
Example: parameter.values <- list(enet.lambda = c(0.001, 0.003, 0.007, 0.01))

handle.missing

logical, optional
Specifies whether or not to handle missing/null values in data.
Defaults to TRUE.

json.export

logical, optional

FALSE: Does not export multiple linear regression model in JSON.
TRUE: Exports multiple linear regression model in JSON.

Currently either PMML or JSON format model can be exported.
JSON format is preferred if both formats are to be exported.
Default to FALSE.

precompute.lms.sketch

logical, optional

FALSE: Do not perform LMS sketch.
TRUE: Performs LMS sketch.

Valid only when resampling.method is set, and data has more rows than columns.
Defaults to TRUE.

stable.sketch.alg

logical, optional
When computing LMS sketch, there are two algorithms to choose: one is more numerical stable than the other but slower. This parameter specifies whether or not to choose the stable algorithm.

FALSE: Do not use the stable algorithm.
TRUE: Uses the stable algorithm.

Valid only when precompute.lms.sketch is valid and set as TRUE.
Defaults to TRUE.

sparse.sketch.alg

logical, optional
Specifies whether or not to invoke the LMS sketch algorithm to with sparse data.

FALSE: Do not use sparse LMS sketch algorithm.
TRUE: Uses sparse LMS sketch algorithm.

Only valid when precompute.lms.sketch is valid and set as TRUE.
Warning: Please Use this option with caution. If the provided data are not sparse, turning on this option will cause LMS sketch to be extremely slow!

max.resource

integer, optional
Specifies the maximum allowed resource budget for single hyper-parameter candidate, whose value must be greater than 0.
Mandatory and valid only wen resource is set.

min.resource.rate

numeric, optional
Specifies the rate between minimum allowed resource and maximum allowed resource.
Valid range is [0, 1).
Valid only when resource is specified.
Defaults to 0.

reduction.rate

numeric, optional
Specifies the reduction rate of available size of hyper-parameter candidates.
For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0
Valid only when resource is set.
Defaults to 3.0.

aggressive.elimination

logical, optional
Specifies whether to perform aggressive elimination behavior for successive-halving algorithm or not.
When set to TRUE, it will eliminate more parameter candidates than expected(defined via reduction_rate).
This can enhance the run-time performance but could result in sub-optimal hyper-parameter candidate.
Valid only when resampling_method is "cv_sha" or "bootstrap_sha".
Defaults to FALSE.

ps.verbose

logical, optional
Specifies whether to output optimal hyper-parameter and all evaluation statistics of related hyper-parameter candidates in attribute statistics or not.
Defaults to TRUE.

resoure

character, optional
Specifies the resource type used in successive-halving and hyperband algorithm for parameter selection.
Currently the only valid option is "max_iter".
Mandatory and valid only when resampling_method is set as one of the following: "cv_sha", "bootstrap_sha", "cv_hyperband" or "bootstrap_hyperband".

Value

Returns an R6 object of class "LinearRegression" with the following attributes and methods:

Attributes

coefficients : DataFrame
Fitted regression coefficients.
fitted : DataFrame
Predicted dependent variable values for training data. Set to NULL if the training data has no row IDs.
statistics : DataFrame
Regression-related statistics, such as mean squared error.
optim.param : DataFrame
Optimal parameters selected. Available only when parameter selection is activated.
pmml : DataFrame
PMML model. (deprecate as JSON format is also supported in the model). Please use semistructured.result shown below to get the model.
semistructured.result : DataFrame
The linear regression model in PMML or JSON format.

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > mlr <- hanaml.LinearRegression(data=df)
   > mlr$CreateModelState()

Arguments:

model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model.
algorithm: character
Specifies the PAL algorithm associated with model.
Defaults to self$pal.algorithm.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model.
Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > mlr <- hanaml.LinearRegression(data=df)
   > mlr$CreateModelState()

After using the model state for real-time scoring, we can delete the state by calling:


   > mlr$DelateModelState()

Arguments:

state: DataFrame
DataFrame containing the state info.
Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Details

Linear regression is an approach to model the linear relationship between a variable, usually referred to as dependent variable, and one or more variables, usually referred to as independent variables, denoted as predictor vector.

Examples

Input DataFrame data:


> data$Collect()
   ID      Y    X1 X2  X3
1  0  -6.879  0.00  A   1
2  1  -3.449  0.50  A   1
3  2   6.635  0.54  B   1
4  3  11.844  1.04  B   1
5  4   2.786  1.50  A   1
6  5   2.389  0.04  B   2
7  6  -0.011  2.00  A   2
8  7   8.839  2.04  B   2
9  8   4.689  1.54  B   1
10 9  -5.507  1.00  A   2

Model traning and a "LinearRegression" object lr is returned:


>lr <- LinearRegression(data = data, key = "ID",
                        label = "Y", thread.ratio = 0.5,
                        categorical.variable = list("X3"))

Output:


> lr$coefficients$Select(c("VARIABLE_NAME",
                           "COEFFICIENT_VALUE"))
        VARIABLE_NAME COEFFICIENT_VALUE
1   __PAL_INTERCEPT__           -5.7045
2                  X1            3.0925
3  X2__PAL_DELIMIT__A            0.0000
4  X2__PAL_DELIMIT__B            9.3675
5  X3__PAL_DELIMIT__1            0.0000
6  X3__PAL_DELIMIT__2           -2.6895

Arguments

Value

Details

Examples

See also