Linear Regression

hanaml.LinearRegression is a R wrapper for SAP HANA PAL linear regression algorithm.

hanaml.LinearRegression(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  formula = NULL,
  solver = NULL,
  var.select = NULL,
  features.must.select = NULL,
  intercept = NULL,
  alpha.to.enter = NULL,
  alpha.to.remove = NULL,
  enet.lambda = NULL,
  enet.alpha = NULL,
  max.iter = NULL,
  tol = NULL,
  pho = NULL,
  stat.inf = NULL,
  adjusted.r2 = NULL,
  dw.test = NULL,
  reset.test = NULL,
  bp.test = NULL,
  ks.test = NULL,
  thread.ratio = NULL,
  categorical.variable = NULL,
  pmml.export = NULL,
  resampling.method = NULL,
  evaluation.metric = NULL,
  fold.num = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  random.state = NULL,
  timeout = NULL,
  progress.indicator.id = NULL,
  parameter.range = NULL,
  parameter.values = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
key	`character, optional` Name of the ID column. If not provided, the data is assumed to have no ID column. No default value.
features	`character of list of characters, optional` Name of feature columns. If not provided, it defaults all non-key, non-label columns of data.
label	`character, optional` Name of the column which specifies the dependent variable. Defaults to the last column of data if not provided.
formula	`formula type, optional` Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination, but do not provide both. Defaults to NULL.
solver	`{"QR", "SVD", "CD", "Cholesky", "ADMM"}, optional` Algorithms to use to solve the least square problem. "QR": QR decomposition. "SVD": singular value decomposition. "CD": cyclical coordinate descent method. "Cholesky": Cholesky decomposition. "ADMM": alternating direction method of multipliers. Defaults to "QR".
var.select	`c("all", "forward", "backward", "stepwise"), optional` Method to perform variable selection. "all": all variables are included. "forward": forward selection. "backward": backward selection. "stepwise": stepwise selection. "forward", "backward" and "stepwise" are supported only when solver is "QR", "SVD" or "Cholesky". Defaults to "all".
features.must.select	`character or list/vector of characters, optional` Specifies the column names of data that needs to be included in the final training model when executing variable selection. Only valid when varselect is not "all". Note: This parameter is a hint. There are exceptional cases that a specified mandatory feature is excluded in the final model. For instance, some mandatory features can be represented as a linear combination of other features, among which some are also mandatory features. No default value.
intercept	`logical, optional` If TRUE, include the intercept in the model. Defaults to TRUE.
alpha.to.enter	`double, optional` P-value for forward selection. When var.select is 'forward', default value is 0.05; when var.select is "stepwise", default value is 0.15.
alpha.to.remove	`double, optional` P-value for backward selection. When var.select is 'backward', default value is 0.1; when var.select is "stepwise", default value is 0.15.
enet.lambda	`double, optional` Penalized weight. Value should be greater than or equal to 0. Valid only when solver is "CD" or "ADMM".
enet.alpha	`double, optional` Elastic net mixing parameter. Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively. Valid only when solver is "CD" or "ADMM". Defaults to 1.0.
max.iter	`integer, optional` Maximum number of passes over training data. If convergence is not reached after the specified number of iterations, an error will be generated. Valid only when solver is "CD" or "ADMM". Defaults to 1e5.
tol	`double, optional` Convergence threshold for coordinate descent. Valid only when solver is "CD". Defaults to 1.0e-7.
pho	`double, optional` Step size for ADMM. Generally, it should be greater than 1. Valid only when solver is "ADMM". Defaults to 1.8.
stat.inf	`logical, optional` If TRUE, output t-value and Pr(>\|t\|) of coefficients. Defaults to FALSE.
adjusted.r2	`logical, optional` If TRUE, include the adjusted R^2 value in statistics. Defaults to FALSE.
dw.test	`logical, optional` If TRUE, conduct Durbin-Watson test under null hypothesis that errors do not follow a first order autoregressive process. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to FALSE.
reset.test	`integer, optional` Specifies the order of Ramsey RESET test. Ramsey RESET test with power of variables ranging from 2 to this value (greater than 1) will be conducted. Value 1 means RESET test will not be conducted. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to 1.
bp.test	`logical, optional` If TRUE, conduct Breusch-Pagan test under null hypothesis that homoscedasticity is satisfied. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to FALSE.
ks.test	`logical, optional` If TRUE, conduct Kolmogorov-Smirnov normality test under null hypothesis that errors follow a normal distribution. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to FALSE.
thread.ratio	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored. Defaults to 0.
categorical.variable	`character or list/vector of characters, optional` Indicates features should be treated as categorical variable. The default behavior is dependent on what input is given: "VARCHAR" and "NVARCHAR": categorical "INTEGER" and "DOUBLE": continuous. VALID only for variables of "INTEGER" type, omitted otherwise. No default value.
pmml.export	`{"no", "single-row", "multi-row"}, optional` Controls whether to output a PMML representation of the model, and how to format the PMML. `"no":` No PMML model. `"single-row":` Exports a PMML model in a maximum of one row. Fails if the model doesn't fit in one row. `"multi-row":` Exports a PMML model, splitting it across multiple rows if it doesn't fit in one. Default to "no".
resampling.method	`c("cv", "bootstrap"), optional` Specifies the resampling values form below list. If no value is specifier, neither model evaluation nor parameter selection is activated.
evaluation.metric	`character, optional` Specifies the evaluation metric for model evaluation or parameter selection. Currently the only valid option(also the default value) is "rmse".
fold.num	`integer, optional` Specifies the fold number for the cross-validation(cv). Mandatory and valid only when `resampling.method` is "cv".
repeat.times	`numeric, optional` Specifies the number of repeat times for resampling. Defaults to 1.
param.search.strategy	`c("grid", "random"), optional` Specifies the method to activate parameter selection. If not specified, model parameter selection shall not be triggered.
random.search.times	`integer, optional` Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid only when `param.search.strategy` is "random".
random.state	`numeric, optional` Specifies the seed for random generation. Use system time when 0 is specified.
timeout	`integer, optional` Specifies maximum running time for model evaluation or parameter selection in seconds. No timeout when 0 is specified.
progress.indicator.id	`character, optional` Sets an ID of progress indicator for model evaluation or parameter selection. No progress indicator is active if no value is provided.
parameter.range	`list, optional` Specifies range of the following parameters for parameter selection: `enet.lambda, enet.alpha`. Parameter range should be specified by 3 numbers in the form of c(start, step, end). Examples: parameter.range <- list(enet.lambda = c(0.01, 0.01, 0.1)), which means taking `enet.lambda` values from 0.01 to 0.1 with 0.01 being the step size, i.e. 0.01, 0.02, 0.03, ..., 0.09, 0.1. If `param.search.strategy` is "random", then the middle term, i.e. step has no effect and thus can be omitted.
parameter.values	`list, optional` Specifies values of the following parameters for parameter selection: `enet.lambda, enet.alpha`. Example: parameter.values <- list(enet.lambda = c(0.001, 0.003, 0.007, 0.01))

Value

Returns a "LinearRegression" object with following values:

coefficients : DataFrame
Fitted regression coefficients.
pmml : DataFrame
PMML model. Set to NULL if no PMML model was requested.
fitted : DataFrame
Predicted dependent variable values for training data. Set to NULL if the training data has no row IDs.
statistics : DataFrame
Regression-related statistics, such as mean squared error.
optim.param : DataFrame
Optimal parameters selected. Available only when parameter selection is activated.

Details

Linear regression is an approach to model the linear relationship between a variable, usually referred to as dependent variable, and one or more variables, usually referred to as independent variables, denoted as predictor vector.

Examples

Input DataFrame data:

> data$Collect()
   ID      Y    X1 X2  X3
1  0  -6.879  0.00  A   1
2  1  -3.449  0.50  A   1
3  2   6.635  0.54  B   1
4  3  11.844  1.04  B   1
5  4   2.786  1.50  A   1
6  5   2.389  0.04  B   2
7  6  -0.011  2.00  A   2
8  7   8.839  2.04  B   2
9  8   4.689  1.54  B   1
10 9  -5.507  1.00  A   2

Model traning and a "LinearRegression" object lr is returned:

>lr <- LinearRegression(data = data, key = "ID",
                        label = "Y", thread.ratio = 0.5,
                        categorical.variable = list("X3"))

Output:

> lr$coefficients$Select(c("VARIABLE_NAME",
                           "COEFFICIENT_VALUE"))
        VARIABLE_NAME COEFFICIENT_VALUE
1   __PAL_INTERCEPT__           -5.7045
2                  X1            3.0925
3  X2__PAL_DELIMIT__A            0.0000
4  X2__PAL_DELIMIT__B            9.3675
5  X3__PAL_DELIMIT__1            0.0000
6  X3__PAL_DELIMIT__2           -2.6895

Arguments

Value

Details

Examples

See also