hanaml.GLM.Rd
hanaml.GLM is a R wrapper for SAP HANA PAL Generalized Linear Model.
hanaml.GLM(
data = NULL,
key = NULL,
features = NULL,
label = NULL,
family = NULL,
link = NULL,
solver = NULL,
handle.missing.fit = NULL,
quasilikelihood = NULL,
max.iter = NULL,
tol = NULL,
significance.level = NULL,
output.fitted = NULL,
alpha = NULL,
lambda = NULL,
num.lambda = NULL,
lambda.min.ratio = NULL,
categorical.variable = NULL,
ordering = NULL,
enet.alpha = NULL,
enet.lambda = NULL,
resampling.method = NULL,
evaluation.metric = NULL,
fold.num = NULL,
repeat.times = NULL,
param.search.strategy = NULL,
random.search.times = NULL,
random.state = NULL,
timeout = NULL,
progress.indicator.id = NULL,
parameter.range = NULL,
parameter.values = NULL
)
DataFrame
DataFrame containting the data.
character
Name of the ID column.
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.
character, optional
The kind of distribution the dependent variable outcomes are
assumed to be drawn from. Must be one of the following:
"gaussian"
"poisson"
"binomial"
"gamma"
"inversegaussian"
"negativebinomial"
"ordinal"
Defaults to "gaussian".
character, optional
GLM link function. Determines the relationship between the linear
predictor and the predicted response. Default and allowed values
depend on family. 'inverse' is accepted as a synonym of
reciprocal'.
gaussian(family)
, identity(default), identity, log, reciprocal
(allowed)
poisson(family)
, log(default), identity, log(allowed)
binomial(family)
, logit(default), logit, probit, comploglog, log
(allowed)
gamma(family)
, reciprocal(default), identity, reciprocal, log(allowed)
inversegaussian(family)
, inversesquare(default), inversesquare,
identity, reciprocal, log(allowed)
negativebinomial(family)
, log(default), identity, log, sqrt(allowed)
ordinal(family)
, logit(default), logit, probit, comploglog(allowed)
c("irls","nr","cd"), optional
The Optimization algorithm.
"irls" Iteratively re-weighted least squares.
"nr" Newton-Raphson.
"cd" Coordinate descent. (Picking coordinate descent activates elastic net regularization.)
Defaults to "irls", except when family is "ordinal". Ordinal regression requires (and defaults to) "nr", and Newton-Raphson is not supported for other values of family.
c("skip", "abort", "fill_zero"), optional
How to handle data rows with missing independent variable values
during fitting.
"skip" Don't use those rows for fitting
"abort" Throw an error if missing independent variable values are found.
"fill_zero" Replace missing values with 0.
Defaults to "skip".
logical, optional
If TRUE, enables the use of quasi-likelihood to estimate overdispersion.
Defaults to FALSE.
integer, optional
Maximum number of optimization iterations.
Defaults to 100 for IRLS
and Newton-Raphson. Defaults to 100000 for coordinate descent.
double, optional
Stopping condition for optimization.
Defaults to 1e-8 for IRLS,
1e-6 for Newton-Raphson, and 1e-7 for coordinate descent.
double, optional
Significance level for confidence intervals and prediction intervals.
Defaults to 0.05.
logical, optional
If TRUE, create the `fitted_` DataFrame of fitted response values
for training data in fit.
Defaults to FALSE.
numeric, optional(deprecated)
Elastic net mixing parameter. Only accepted when using coordinate
descent. Should be between 0 and 1 inclusive. Defaults to 1.0.
Will be replaced by enet.alpha
in future release.
numeric, optional(deprecated)
Coefficient for L1 and L2 regularization.
No default value.
Will be replaced by enet.lambda
in future release.
integer
The number of lambda values. Only accepted when using coordinate
descent.
Defaults to 100.
double, optional
The smallest value of lambda, as a fraction of the maximum lambda,
where lambda_max is the smallest value for which all coefficients
are zero. Only accepted when using coordinate descent.
Defaults to 0.01 when the number of observations is smaller than the number
of covariates, and 0.0001 otherwise.
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.
list of characters, optional
Specifies the (ascending)order of categories for ordinal regression.
Applicable only when the label column for ordinal regression is string-valued.
By default, numeric order is adopted for integer values and alphabetical order adopted for
strings.
numeric, optional
Elastic net mixing parameter. Only accepted when using coordinate
descent. Should be between 0 and 1 inclusive.
Defaults to 1.0.
numeric, optional
Coefficient for L1 and L2 regularization.
No default value.
c("cv", "bootstrap"), optional
Specifies the resampling method for model evaluation or parameter selection.
If no value is specifier, neither model evaluation
nor parameter selection is activated.
c("rmse", "mae", "error_rate"), optional
Specifies the evaluation metric for model evaluation or parameter selection.
Only if family
is specified as "ordinal", then evaluation.metric
can be specified as "error_rate".
Must be specified together with resampling.method
to activate
model evaluation and parameter selection.
No default value.
integer, optional
Specifies the fold number for the cross-validation(cv).
Mandatory and valid only when resampling.method
is "cv".
numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.
c("grid", "random"), optional
Specifies the method to activate parameter selection.
If not specified, model parameter selection shall not be triggered.
integer, optional
Specifies the number of times to randomly select candidate parameters for selection.
Mandatory and valid only when param.search.strategy
is "random".
numeric, optional
Specifies the seed for random generation.
Use system time when 0 is specified.
integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds.
No timeout when 0 is specified.
character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
list, optional
Specifies range of the following parameters for parameter selection:enet.lambda, enet.alpha
.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(enet.lambda = c(0.01, 0.01, 0.1)), which means taking
enet.lambda
values from 0.01 to 0.1 with 0.01 being the step size, i.e.
0.01, 0.02, 0.03, ..., 0.09, 0.1.
If param.search.strategy
is 'random', then the middle term,
i.e. step has no effect and thus can be omitted.
list, optional
Specifies values of the following parameters for parameter selection:enet.lambda, enet.alpha
.
Example: parameter.values <- list(enet.lambda = c(0.001, 0.003, 0.007, 0.01))
A "GLM" object with the following attributes:
statistics: DataFrame
Training statistics and model information other than the
coefficients and covariance matrix.
coefficients: DataFrame
Model coefficients.
covariance: DataFrame
Covariance matrix. Set to NULL for coordinate descent.
fitted: DataFrame
Predicted values for the training data. Set to NULL if
output.fitted is FALSE.
Input DataFrame data:
> data$Collect()
ID X Y
1 1 0 -1
2 2 0 -1
3 3 1 0
4 4 1 0
5 5 1 0
6 6 1 0
7 7 2 1
8 8 2 1
9 9 2 1
Call the function:
> glm <- hanaml.GLM(data = data, key = "ID", label = "Y", features = "X",
solver = "irls", output.fitted = TRUE,
family = "poisson", link = "log")
Output:
> glm$coefficients$Collect()
VARIABLE_NAME COEFFICIENT SE SCORE PROBABILITY CI_LOWER CI_UPPER
1 __PAL_INTCP__ -0.2949583 0.4530004 -0.6511215 0.51496806 -1.18282271 0.5929061
2 X 1.0698198 0.5405998 1.9789496 0.04782168 0.01026362 2.1293760