R: Linear Regression

hanaml.LinearRegression {hana.ml.r}

R Documentation

Linear Regression

Description

hanaml.LinearRegression is a R wrapper for PAL linear regression algorithm.

Usage

hanaml.LinearRegression(conn.context, data = NULL, key = NULL, features = NULL,
                       label = NULL, formula = NULL, solver = NULL, var.select = NULL,
                       intercept = NULL, alpha.to.enter = NULL, alpha.to.remove = NULL,
                       enet.lambda = NULL, enet.alpha = NULL, max.iter = NULL,
                       tol = NULL, pho = NULL, stat.inf = NULL, adjusted.r2 = NULL,
                       dw.test = NULL, reset.test = NULL, bp.test = NULL,
                       ks.test = NULL, thread.ratio = NULL, categorical.variable = NULL,
                       pmml.export = NULL)

Arguments

`conn.context`	`ConnectionContext` The connection to the SAP HANULL system.
`data`	`DataFrame` DataFrame containing the data.
`key`	`character` Name of the ID column of data. If not provided, then it is assumed that data has no ID column.
`features`	`list of character, optional` Names of the feature columns. If not provided, it defaults to all non-ID, no-label columns.
`label`	`character` Name of column in data that specifies the dependent variable. Defaults to the final column if not provided.
`formula`	`formula type, optional` Formula to be used for model generation. format = label~<feature_list> e.g.formula = LABEL~V1+V2+V3 You can either give the formula, or a feature and label combination. Do not provide both. Defaults to NULL.
`solver`	`{"QR", "SVD", "CD", "Cholesky","ADMM"}, optional` Algorithms to use to solve the least square problem. Case-insensitive. "QR": QR decomposition. "SVD": singular value decomposition. "CD": cyclical coordinate descent method. "Cholesky": Cholesky decomposition. "ADMM": alternating direction method of multipliers. Defaults to "QR".
`var.select`	`{"all", "forward", "backward"}, optional` Method to perform variable selection. all: all variables are included. forward: forward selection. backward: backward selection. 'forward' and 'backward' selection are supported only when solver is 'QR', 'SVD' or 'Cholesky'. Defaults to 'all'.
`intercept`	`logical, optional` If TRUE, include the intercept in the model. Defaults to TRUE.
`alpha.to.enter`	`double, optional` P-value for forward selection. Valid only when ‘var_select' is ’forward'. Defaults to 0.05.
`alpha.to.remove`	`double, optional` P-value for backward selection. Valid only when ‘var_select' is ’backward'. Defaults to 0.1.
`enet.lambda`	`double, optional` Penalized weight. Value should be greater than or equal to 0. Valid only when ‘solver' is ’CD' or 'ADMM'.
`enet.alpha`	`double, optional` Elastic net mixing parameter. Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively. Valid only when solver is 'CD' or 'ADMM'. Defaults to 1.0.
`max.iter`	`integer, optional` Maximum number of passes over training data. If convergence is not reached after the specified number of iterations, an error will be generated. Valid only when solver is 'CD' or 'ADMM'. Defaults to 1e5.
`tol`	`double, optional` Convergence threshold for coordinate descent. Valid only when solver is 'CD'. Defaults to 1.0e-7.
`pho`	`double, optional` Step size for ADMM. Generally, it should be greater than 1. Valid only when solver is 'ADMM'. Defaults to 1.8.
`stat.inf`	`logical, optional` If TRUE, output t-value and Pr(>\|t\|) of coefficients. Defaults to FALSE.
`adjusted.r2`	`logical, optional` If TRUE, include the adjusted R^2 value in statistics. Defaults to FALSE.
`dw.test`	`logical, optional` If TRUE, conduct Durbin-Watson test under null hypothesis that errors do not follow a first order autoregressive process. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to FALSE.
`reset.test`	`integer, optional` Specifies the order of Ramsey RESET test. Ramsey RESET test with power of variables ranging from 2 to this value (greater than 1) will be conducted. Value 1 means RESET test will not be conducted. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to 1.
`bp.test`	`logical, optional` If TRUE, conduct Breusch-Pagan test under null hypothesis that homoscedasticity is satisfied. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to FALSE.
`ks.test`	`logical, optional` If TRUE, conduct Kolmogorov-Smirnov normality test under null hypothesis that errors follow a normal distribution. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to FALSE.
`thread.ratio`	`double, optional` Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Valid only when solver is 'QR', 'CD', 'Cholesky' or 'ADMM'. Defaults to 0.0.
`categorical.variable`	`character or list of characters, optional` integer columns specified in this list will be treated as categorical data. Other integer columns will be treated as continuous.
`pmml.export`	`('no', 'multi-row'), optional` Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive. 'no' or not provided: No PMML model. 'multi-row': Exports a PMML model, splitting it across multiple rows if it doesn't fit in one. Defaults to 'no'.

Format

R6Class object.

Details

Linear regression is an approach to model the linear relationship and one or more variables, usually referred to as independent variables, denoted as predictor vector.

Value

Return a "LinearRegression" object with following values:

coefficients : DataFrame
Fitted regression coefficients.
pmml : DataFrame
PMML model. Set to None if no PMML model was requested.
fitted : DataFrame
Predicted dependent variable values for training data. Set to None if the training data has no row IDs.
statistics : DataFrame
Regression-related statistics, such as mean squared error.

Examples

## Not run: 
Input DataFrame df for training:

> df$Collect()
  ID      Y    X1 X2  X3
0  0  -6.879  0.00  A   1
1  1  -3.449  0.50  A   1
2  2   6.635  0.54  B   1
3  3  11.844  1.04  B   1
4  4   2.786  1.50  A   1
5  5   2.389  0.04  B   2
6  6  -0.011  2.00  A   2
7  7   8.839  2.04  B   2
8  8   4.689  1.54  B   1
9  9  -5.507  1.00  A   2

Model traning and a "LinearRegression" object lr is returned:

>lr <- LinearRegression(conn.context = conn, data = df, key = "ID",
                       label = "Y", thread.ratio = 0.5,
                       categorical.variable = list("X3"))

Output:

> lr$coefficients
COEFFICIENT COEFFICIENT          VALUE
1   \__PAL_INTERCEPT__          -5.7045
2                  X1           3.0925
3  X2__PAL_DELIMIT__A           0.0000
4  X2__PAL_DELIMIT__B           9.3675
5  X3__PAL_DELIMIT__1           0.0000
lr$s6  X3__PAL_DELIMIT__2          -2.6895
7   \__PAL_INTERCEPT__          -5.7045
8                  X1           3.0925
9  X2__PAL_DELIMIT__A           0.0000
10 X2__PAL_DELIMIT__B           9.3675
11 X3__PAL_DELIMIT__1           0.0000
12 X3__PAL_DELIMIT__2          -2.6895

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]