hanaml.SVRanking.Rd
hanaml.SVRanking is an R wrapper of SAP HANA PAL SVM for ranking.
hanaml.SVRanking(
data = NULL,
key = NULL,
features = NULL,
label = NULL,
qid = NULL,
kernel = NULL,
thread.ratio = NULL,
degree = NULL,
gamma = NULL,
coef.lin = NULL,
coef.const = NULL,
c = NULL,
scale.info = NULL,
shrink = NULL,
handle.missing = NULL,
categorical.variable = NULL,
category.weight = NULL,
tol = NULL,
evaluation.seed = NULL,
probability = NULL,
compression = NULL,
max.bits = NULL,
max.quantization.iter = NULL,
resampling.method = NULL,
evaluation.metric = NULL,
fold.num = NULL,
repeat.times = NULL,
param.search.strategy = NULL,
random.search.times = NULL,
random.state = NULL,
timeout = NULL,
progress.indicator.id = NULL,
parameter.range = NULL,
parameter.values = NULL,
reduction.rate = NULL,
aggressive.elimination = NULL
)
DataFrame
DataFrame containting the data.
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.
character
Column name of QueryID.
If not provided, it defaults to the last non-ID, non-label column.
{"linear", "poly", "rbf", "sigmoid"}, optional
kernel function.
Defaults to "rbf".
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
integer, optional
Coefficient for the 'poly' kernel function.
Only valid when kernel = 'poly'.
Value range: >= 1.
Defaults to 3.
double, optional
Coefficient for the 'rbf' kernel function.
Only valid when kernel = 'rbf'.
Defaults to 1.0/number of features in the dataset.
double, optional
Coefficient for the 'poly' or 'sigmoid' kernel function.
Only valid when kernel = 'poly' or 'sigmoid'.
Defaults to 0.
double, optional
Coefficient for the 'poly' or 'sigmoid' kernel function.
Only valid when kernel = 'poly' or 'sigmoid'
Defaults to 0.
double, optional
Trade-off between training error and margin.
value range: > 0.
Defaults to 100.
character, optional
"no" : No scale
"standardization": The algorithm transforms the data to have zero mean and unit variance.
"rescale" : The algorithm rescales the range of the features to scale the range in [-1,1].
Defaults to "standardization".
logical, optional
Decides whether to use shrink strategy
or not.Using shrink strategy may accelerate the training process.
FALSE: Does not use shrink strategy.
TRUE: Uses shrink strategy.
Defaults to TRUE.
logical, optional
Specifies whether to impute the missing values of the input data or not.
If set to FALSE, all rows with missing values will be deleted.
Defaults to TRUE.
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.
double, optional
Represents the weight of category
attributes. The value must be greater than 0.
Defaults to 0.707.
double, optional
Specifies the error tolerance in the training
process. The value must be greater than 0.
Defaults to 0.001.
integer, optional(deprecated)
The random seed in parameter selection(same as random.state
).
The value must be no less than 0.
If set to 0, then system time is used for random generation.
Defaults to 0.
If evaluation.seed
and random.state
are set simultaneously,
random.state
takes higher priority.
logical, optional
If you want to output probability when scoring, set this to TRUE.
Defaults to FALSE.
logical, optional
Specifies if the model is stored in compressed format.
Default value depends on the SAP HANA Version.
Please refer to the corresponding documentation of SAP HANA PAL.
integer, optional
The maximum number of bits to quantize continuous features.
Equivalent to use
2max.bits
bins.
Must be less than 31.
Only valid Valid only when the value of compression is TRUE.
Defaults to 12.
integer, optional
The maximum iteration steps for quantization.
Only valid Valid only when the value of compression is TRUE.
Defaults to 1000.
character, optional
Specifies the resampling method for model evaluation or parameter selection.
Valid options are listed as follows:
"cv", "cv_sha", "cv_hyperband", "bootstrap",
"bootstrap_sha", "bootstrap_hyperband".
Note that successive-halving(SHA) or hyperband resampling methods
are applicable to parameter selection only, not model evaluation.
If no value is specified, neither model evaluation
nor parameter selection is activated.
character, optional
Specifies the evaluation metric for model evaluation or parameter selection.
Currently the only valid option for SVM ranking is "error_rate".
Must be specified together with resampling.method
to activate
model evaluation or parameter selection.
integer, optional
Specifies the fold number for the cross-validation(cv).
Mandatory and valid only when resampling.method
is "cv", "cv_sha" or "cv_hyperband".
numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.
c("grid", "random"), optional
Specifies the method to activate parameter selection.
If not specified, model parameter selection shall not be triggered.
Defaults to "random" and cannot be changed if resampling.method
is set as "cv_hyperband" or "bootstrap_hyperband"; otherwise no
default value.
integer, optional
Specifies the number of times to randomly select candidate parameters for selection.
Mandatory and valid only when param.search.strategy
is set as "random", or when
resampling.method
is set as "cv_hyperband" or "bootstrap_hyperband".
numeric, optional
Specifies the seed for random generation.
Use system time when 0 is specified.
integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds.
No timeout when 0 is specified.
character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
list, optional
Specifies range of the following parameters for parameter selection:c, gamma, degree, coef.lin, coef.const
.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(c = c(50, 10, 100)), which means taking
c
values from 50 to 100 with 10 being the step size, i.e.
50, 60, 70, 80, 90, 100.
If param.search.strategy
is 'random', then the middle term,
i.e. step has no effect and thus can be omitted.
list, optional
Specifies values of the following parameters for parameter selection:c, gamma, degree, coef.lin, coef.const
.
Example: parameter.values <- list(gamma = c(0.01, 0.05, 0.07))
numeric, optional
Specifies the reduction rate of successive-halving(SHA)
or hyperband method.
Valid only when resampling.method
is specified as one of the following:
"cv_sha", "cv_hyperband", "bootstrap_sha", "bootstrap_hyperband".
Defaults to 3.0.
logical, optional
Specifies whether to perform aggressive elimination behavior for successive-halving algorithm or not.
When set to TRUE, it will eliminate more parameter candidates than
expected(defined via reduction_rate
).
This can enhance the run-time performance but could result in sub-optimal hyper-parameter candidate.
Valid only when resampling_method
specified as "cv_sha", "bootstrap_sha".
Defaults to FALSE.
An R6 object of class "SVRanking" with the following attributes and methods:
Attributes
model: DataFrame
Model content
stat: DataFrame
statistics.
Methods
CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)
Usage:
> srk <- hanaml.SVRanking(data=df)
> srk$CreateModelState()
Arguments:
model: DataFrame
DataFrame containing the model for parsing. Defaults to self$model
.
algorithm: character
Specifies the PAL algorithm associated with model
. Defaults to self$pal.algorithm
.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func
.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model
.
Defaults to FALSE.
After calling this method, an attribute state
that contains the parsed info for model
shall be assigned
to the corresponding R6 object.
DeleteModelState(state=NULL)
Usage:
Assuming we have trained a hanaml
model and created its model state, like the following:
> srk <- hanaml.SVRanking(data=df)
> srk$CreateModelState()
After using the model state for real-time scoring, we can delete the state by calling:
> svr$DelateModelState()
Arguments:
state: DataFrame
DataFrame containing the state info.
Defaults to self$state
.
After calling this method, the specified model state shall be cleaned up and associated memory be released.
Call the function:
> svranking <- hanaml.SVRanking(data,
qid="QID",
gamma = 0.005)
When decision boundaries between different class are complicated, there could be a large number of support vectors
involved in the final model, which leads to large model size. Model compression aims at reducing the size of
the SVM model with minimum loss of accuracy.
Relevant Parameters
compression
: This parameter serves as a trigger for SVM model compression. Set it to TRUE
if you want to trained model is to be stored in compressed format.
max.bits
: This parameter sets up the maximum number of bits to quantize values of support vectors, which is
Equivalent to use \(2^{max.bits}\) bins to quantize values of support vectors.
Reducing the number of bins may affect the precision of support vectors and the accuracy in prediction.
max.quantization.iter
: This parameter sets up the maximum iteration number of the quantization process.
If the specified number is too small, the quantization process may fail to converge,
which could affect the accuracy further.