Logistic Regression

hanaml.LogisticRegression is a R wrapper for SAP HANA PAL Logistic Regression.

hanaml.LogisticRegression(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  formula = NULL,
  enet.alpha = NULL,
  enet.lambda = NULL,
  tol = NULL,
  epsilon = NULL,
  solver = NULL,
  max.iter = NULL,
  thread.ratio = NULL,
  standardize = NULL,
  max.pass.number = NULL,
  lbfgs.m = NULL,
  pmml.export = NULL,
  stat.inf = NULL,
  categorical.variable = NULL,
  class.map0 = NULL,
  class.map1 = NULL,
  multi.class = FALSE,
  sgd.batch.number = NULL,
  precompute = NULL,
  handle.missing = NULL,
  resampling.method = NULL,
  evaluation.metric = NULL,
  fold.num = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  random.state = NULL,
  timeout = NULL,
  progress.indicator.id = NULL,
  parameter.range = NULL,
  parameter.values = NULL,
  json.export = NULL,
  resource = NULL,
  max.resource = NULL,
  min.resource.rate = NULL,
  reduction.rate = NULL,
  aggressive.elimination = NULL,
  ps.verbose = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula, or a feature and label combination, but do not provide both.
Defaults to NULL.

enet.alpha

numeric, optional
Elastic net mixing parameter.

Not valid when solver is "stochastic" if multi.class is FALSE.
Not valid when solver is "lbfgs" if multi.class is TRUE.

Defaults to 1.0 .

enet.lambda

numeric, optional
Penalized weight for elastic-net term.

Invalid when solver is "stochastic" if multi.class is FALSE.
Invalid when solver is "lbfgs" if multi.class is TRUE.

Defaults to 0.0 .

tol

double, optional
Convergence threshold for exiting iterations.
Defaults to 1.0e-7 when solver is 'cyclical', otherwise it defaults to '1.0e-6'.

epsilon

double, optional
The parameter determines the accuracy with which the solution is to be found. Defaults to 1.0e-6 when solver is 'newton', or '1.0e-5' when solver is 'lbfgs'.

solver

character, optional
Optimization algorithm.
Possible values include:

"auto": Automatically determined from data and other parameters.
"newton": Newton iteration method.
"cyclical" - Cyclical coordinate descent method to fit elastic net regularized Logistic Regression.
"lbfgs" - LBFGS method. Recommended when having many independent variables.
"stochastic" - Stochastic gradient descent method. Recommended when dealing with very large dataset.
"proximal" - Proximal gradient descent method to fit elastic net regularized logistic regression.

All values are available when multi.class is FALSE, with emphasis that

Newtons's method cannot solve binary logistic regression problems with LASSO penalty,
stochastic method can only solve problems without penalty;

Otherwise only "lbfgs" and "cyclical" are available, such that the LBFGS method can only solver multi-class logistic regression problems with ridge penalty or without penalty.

max.iter

integer, optional
Maximum number of iterations taken for the solvers to converge. If convergence is not reached after this number, an error will be generated. For multi-class, the default value is 100.
For binary-class, the default value is 100000 when solver is "cyclical", 1000 when solver is "proximal", otherwise the default value is 100.

thread.ratio

double, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all the currently available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Only valid when multi.class is FALSE. Defaults to 1.0.

standardize

logical, optional
Controls whether to standardize the data to have zero mean and unit variance.

FALSE: indicates no zero mean and unit variance.
TRUE: standarize the data with zero mean and unit variance.

Defaults to TRUE.

max.pass.number

integer, optional
The maximum number of passes over the data. Defaults to 1.
Warning: only valid when solver is "stochastic" and multi.class is FALSE.

lbfgs.m

integer
Number of past updates to be kept. Only available when solver is "lbfgs".
Defaults to 6.

pmml.export

"no", "single-row", "multi-row"
Controls whether to output a PMML representation of the model and how to format the PMML.
If multi.class is TRUE, valid options are:

"no" - No PMML model.
"multi-row" - Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.

Otherwise if multi.class is FALSE, valid options are:

"no" - No PMML model.
"single-row" - Exports a PMML model in a maximum of one row. Fails if the model doesn't fit in one row.
"multi-row" - Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.

Defaults to "no" for both binary and multi-class cases.

stat.inf

logical, optional
Indicates whether or not to a calculate statistical inferences from the given data.

FALSE - Does not calculate statistical inference.
TRUE - Calculates statistical inference.

Defaults to FALSE.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

class.map0

character, optional
Categorical label to map to 0. Only valid when multi.class is FALSE. class.map0 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

class.map1

character, optional
Categorical label to map to 1. Only valid when multi.class is FALSE. class.map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

multi.class

logical, optional
If set to TRUE, multi-class logistic regression is performed.
Otherwise only binary logistic regression is performed.
Defaults to FALSE.

sgd.batch.number

integer, optional
The batch number of stochastic gradient method. Valid only when multi.class is FALSE and method is "stochastic". Defaults to 1.

precompute

logical, optional
Whether or not to precompute the Gram matrix for cyclical coordinate descent method.
Valid only when method is "cyclical". Defaults to TRUE.

handle.missing

logical, optional
Whether or not to impute the missing values of the input training data. Defaults to TRUE.

resampling.method

character, optional
Specifies the resampling method for model evaluation or parameter selection.
Valid options are listed as follows:
"cv", "stratified_cv", "bootstrap", "stratified_bootstrap", "cv_sha", "stratified_cv_sha", "bootstrap_sha", "stratified_bootstrap_sha", "cv_hyperband", "stratified_cv_hyperband", "bootstrap_hyperband", "stratified_bootstrap_hyperband".
Note that resampling methods with suffix "sha" or "hyperband" are only applicable to parameter selection, not model evaluation.
If no value is specified, neither model evaluation nor parameter selection is activated.

evaluation.metric

character, optional
Specifies the evaluation metric for model evaluation or parameter selection.
Currently valid options are: "accuracy", "f1_score", "auc", "nll".
Must specify a value together with resampling.method to activate model evaluation or parameter selection.
No default value.

fold.num

integer, optional
Specifies the fold number for the cross-validation(cv). Mandatory and valid only when resampling.method is specified and contains "cv" as a sub-string.

repeat.times

numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.

param.search.strategy

c("grid", "random"), optional
Specifies the method to activate parameter selection.
If not specified, model parameter selection shall not be triggered.
Defaults to "random" and cannot be changed if resampling.method is set as one of the following: "cv_hyperband", "bootstrap_hyperband", "stratified_cv_hyperband", "stratified_bootstrap_hyperband"; otherwise no default value.

random.search.times

integer, optional
Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid only when param.search.strategy is set as "random", or when resampling.method is set as one of the following: "cv_hyperband", "bootstrap_hyperband", "stratified_cv_hyperband", "stratified_bootstrap_hyperband".

random.state

numeric, optional
Specifies the seed for random generation.
Use system time when 0 is specified.

timeout

integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds.
No timeout when 0 is specified.

progress.indicator.id

character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.

parameter.range

list, optional
Specifies range of the following parameters for parameter selection:
enet.lambda, enet.alpha.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(enet.lambda = c(0.01, 0.01, 0.1)), which means taking enet.lambda values from 0.01 to 0.1 with 0.01 being the step size, i.e. 0.01, 0.02, 0.03, ..., 0.09, 0.1.
If param.search.strategy is "random", then the middle term, i.e. step has no effect and thus can be omitted.

parameter.values

list, optional
Specifies values of the following parameters for parameter selection:
enet.lambda, enet.alpha.
Example: parameter.values <- list(enet.lambda = c(0.001, 0.003, 0.007, 0.01))

json.export

logical, optional

FALSE: Does not export multiple Logistic regression model in JSON.
TRUE: Exports multiple Logistic regression model in JSON.

Only valid when multi.class is TRUE.
Currently either PMML or JSON format model can be exported. JSON format is preferred if both formats are to be exported.
Default to FALSE.

resource

character, optional
Specifies the resource type used in successive-halving and hyperband algorithm for parameter selection.
Currently the valid options are "max.iter" and "max.pass.number".
Mandatory and valid only when resampling.method is specified with suffix "sha" or "hyperband".

max.resource

integer, optional
Specifies the maximum allowed resource budget for single hyper-parameter candidate, whose value must be greater than 0.
Mandatory and valid only wen resource is set.

min.resource.rate

numeric, optional
Specifies the rate between minimum allowed resource and maximum allowed resource.
Valid range is [0, 1).
Valid only when resource is specified.
Defaults to 0.

reduction.rate

numeric, optional
Specifies the reduction rate of available size of hyper-parameter candidates.
For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0
Valid only when resource is set.
Defaults to 3.0.

aggressive.elimination

logical, optional
Specifies whether to perform aggressive elimination behavior for successive-halving algorithm or not.
When set to TRUE, it will eliminate more parameter candidates than expected(defined via reduction_rate).
This can enhance the run-time performance but could result in sub-optimal hyper-parameter candidate.
Valid only when resampling.method is specified with suffix "sha". Defaults to FALSE.

ps.verbose

logical, optional
Specifies whether to output optimal hyper-parameter and all evaluation statistics of related hyper-parameter candidates in attribute statistics or not.
Defaults to TRUE.

Value

A "LogisticRegression" object with the following attributes:

result: DataFrame
Coefficient values for logistic regression model(together with z-scores and p-values).
statistic.info: DataFrame
Related statistics for the logistic regression model and its solving process, including AIC, objective-value, log-likelihood, number of iterations used, solution status, etc.
optimal.param: DataFrame
Optimal model parameters selected. Reserved for model selection using cross-validation.
pmml: DataFrame
LogisticRegression model in PMML format. In multi-class logistic regression. Please use semistructured.result shown below to get the model in PMML or JSON format.
semistructured.result: DataFrame
A multi-class logistic regression model in PMML or JSON format.

Examples

Call the function:


> lr <- hanaml.LogisticRegression(data=data.fit)
OR
> lr <- hanaml.LogisticRegression(data=data.fit,
                                  formula=CATEGORY~V1+V2+V3,
                                  solver="newton",
                                  thread.ratio=0.1,
                                  max.iter=1000,
                                  categorical.variable="V3",
                                  pmml.export="single-row",
                                  stat.inf=TRUE,
                                  tol=0.000001)

Arguments

Value

Examples

See also