hanaml.GLM is a R wrapper for SAP HANA PAL Generalized Linear Model.

hanaml.GLM(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  family = NULL,
  link = NULL,
  solver = NULL,
  handle.missing.fit = NULL,
  quasilikelihood = NULL,
  max.iter = NULL,
  tol = NULL,
  significance.level = NULL,
  output.fitted = NULL,
  alpha = NULL,
  lambda = NULL,
  num.lambda = NULL,
  lambda.min.ratio = NULL,
  categorical.variable = NULL,
  ordering = NULL,
  enet.alpha = NULL,
  enet.lambda = NULL,
  resampling.method = NULL,
  evaluation.metric = NULL,
  fold.num = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  random.state = NULL,
  timeout = NULL,
  progress.indicator.id = NULL,
  parameter.range = NULL,
  parameter.values = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

family

character, optional
The kind of distribution the dependent variable outcomes are assumed to be drawn from. Must be one of the following:

  • "gaussian"

  • "poisson"

  • "binomial"

  • "gamma"

  • "inversegaussian"

  • "negativebinomial"

  • "ordinal"

Defaults to "gaussian".

link

character, optional
GLM link function. Determines the relationship between the linear predictor and the predicted response. Default and allowed values depend on family. 'inverse' is accepted as a synonym of reciprocal'.

  • gaussian(family), identity(default), identity, log, reciprocal (allowed)

  • poisson(family), log(default), identity, log(allowed)

  • binomial(family), logit(default), logit, probit, comploglog, log (allowed)

  • gamma(family), reciprocal(default), identity, reciprocal, log(allowed)

  • inversegaussian(family), inversesquare(default), inversesquare, identity, reciprocal, log(allowed)

  • negativebinomial(family), log(default), identity, log, sqrt(allowed)

  • ordinal(family), logit(default), logit, probit, comploglog(allowed)

solver

c("irls","nr","cd"), optional
The Optimization algorithm.

  • "irls" Iteratively re-weighted least squares.

  • "nr" Newton-Raphson.

  • "cd" Coordinate descent. (Picking coordinate descent activates elastic net regularization.)

Defaults to "irls", except when family is "ordinal". Ordinal regression requires (and defaults to) "nr", and Newton-Raphson is not supported for other values of family.

handle.missing.fit

c("skip", "abort", "fill_zero"), optional
How to handle data rows with missing independent variable values during fitting.

  • "skip" Don't use those rows for fitting

  • "abort" Throw an error if missing independent variable values are found.

  • "fill_zero" Replace missing values with 0.

Defaults to "skip".

quasilikelihood

logical, optional
If TRUE, enables the use of quasi-likelihood to estimate overdispersion. Defaults to FALSE.

max.iter

integer, optional
Maximum number of optimization iterations.
Defaults to 100 for IRLS and Newton-Raphson. Defaults to 100000 for coordinate descent.

tol

double, optional
Stopping condition for optimization.
Defaults to 1e-8 for IRLS, 1e-6 for Newton-Raphson, and 1e-7 for coordinate descent.

significance.level

double, optional
Significance level for confidence intervals and prediction intervals.
Defaults to 0.05.

output.fitted

logical, optional
If TRUE, create the `fitted_` DataFrame of fitted response values for training data in fit.
Defaults to FALSE.

alpha

numeric, optional(deprecated)
Elastic net mixing parameter. Only accepted when using coordinate descent. Should be between 0 and 1 inclusive. Defaults to 1.0.
Will be replaced by enet.alpha in future release.

lambda

numeric, optional(deprecated)
Coefficient for L1 and L2 regularization.
No default value. Will be replaced by enet.lambda in future release.

num.lambda

integer
The number of lambda values. Only accepted when using coordinate descent.
Defaults to 100.

lambda.min.ratio

double, optional
The smallest value of lambda, as a fraction of the maximum lambda, where lambda_max is the smallest value for which all coefficients are zero. Only accepted when using coordinate descent.
Defaults to 0.01 when the number of observations is smaller than the number of covariates, and 0.0001 otherwise.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

  • "VARCHAR" and "NVARCHAR": categorical

  • "INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

ordering

list of characters, optional
Specifies the (ascending)order of categories for ordinal regression.
Applicable only when the label column for ordinal regression is string-valued.
By default, numeric order is adopted for integer values and alphabetical order adopted for strings.

enet.alpha

numeric, optional
Elastic net mixing parameter. Only accepted when using coordinate descent. Should be between 0 and 1 inclusive.
Defaults to 1.0.

enet.lambda

numeric, optional
Coefficient for L1 and L2 regularization.
No default value.

resampling.method

c("cv", "bootstrap"), optional
Specifies the resampling method for model evaluation or parameter selection.
If no value is specifier, neither model evaluation nor parameter selection is activated.

evaluation.metric

c("rmse", "mae", "error_rate"), optional
Specifies the evaluation metric for model evaluation or parameter selection.
Only if family is specified as "ordinal", then evaluation.metric can be specified as "error_rate".
Must be specified together with resampling.method to activate model evaluation and parameter selection.
No default value.

fold.num

integer, optional
Specifies the fold number for the cross-validation(cv). Mandatory and valid only when resampling.method is "cv".

repeat.times

numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.

param.search.strategy

c("grid", "random"), optional
Specifies the method to activate parameter selection. If not specified, model parameter selection shall not be triggered.

random.search.times

integer, optional
Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid only when param.search.strategy is "random".

random.state

numeric, optional
Specifies the seed for random generation.
Use system time when 0 is specified.

timeout

integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds.
No timeout when 0 is specified.

progress.indicator.id

character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.

parameter.range

list, optional
Specifies range of the following parameters for parameter selection:
enet.lambda, enet.alpha.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(enet.lambda = c(0.01, 0.01, 0.1)), which means taking enet.lambda values from 0.01 to 0.1 with 0.01 being the step size, i.e. 0.01, 0.02, 0.03, ..., 0.09, 0.1.
If param.search.strategy is 'random', then the middle term, i.e. step has no effect and thus can be omitted.

parameter.values

list, optional
Specifies values of the following parameters for parameter selection:
enet.lambda, enet.alpha.
Example: parameter.values <- list(enet.lambda = c(0.001, 0.003, 0.007, 0.01))

Value

A "GLM" object with the following attributes:

  • statistics: DataFrame
    Training statistics and model information other than the coefficients and covariance matrix.

  • coefficients: DataFrame
    Model coefficients.

  • covariance: DataFrame
    Covariance matrix. Set to NULL for coordinate descent.

  • fitted: DataFrame
    Predicted values for the training data. Set to NULL if output.fitted is FALSE.

Examples

Input DataFrame data:


> data$Collect()
   ID  X  Y
1   1  0 -1
2   2  0 -1
3   3  1  0
4   4  1  0
5   5  1  0
6   6  1  0
7   7  2  1
8   8  2  1
9   9  2  1

Call the function:


> glm <- hanaml.GLM(data = data, key = "ID", label = "Y", features = "X",
                    solver = "irls", output.fitted = TRUE,
                    family = "poisson", link = "log")

Output:


> glm$coefficients$Collect()
  VARIABLE_NAME COEFFICIENT        SE      SCORE PROBABILITY    CI_LOWER   CI_UPPER
1 __PAL_INTCP__  -0.2949583 0.4530004 -0.6511215  0.51496806 -1.18282271  0.5929061
2             X   1.0698198 0.5405998  1.9789496  0.04782168  0.01026362  2.1293760

See also