Support Vector Ranking

hanaml.SVRanking is an R wrapper of SAP HANA PAL SVM for ranking.

hanaml.SVRanking(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  qid = NULL,
  kernel = NULL,
  thread.ratio = NULL,
  degree = NULL,
  gamma = NULL,
  coef.lin = NULL,
  coef.const = NULL,
  c = NULL,
  scale.info = NULL,
  shrink = NULL,
  handle.missing = NULL,
  categorical.variable = NULL,
  category.weight = NULL,
  tol = NULL,
  evaluation.seed = NULL,
  probability = NULL,
  compression = NULL,
  max.bits = NULL,
  max.quantization.iter = NULL,
  resampling.method = NULL,
  evaluation.metric = NULL,
  fold.num = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  random.state = NULL,
  timeout = NULL,
  progress.indicator.id = NULL,
  parameter.range = NULL,
  parameter.values = NULL,
  reduction.rate = NULL,
  aggressive.elimination = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

qid

character
Column name of QueryID. If not provided, it defaults to the last non-ID, non-label column.

kernel

{"linear", "poly", "rbf", "sigmoid"}, optional
kernel function.
Defaults to "rbf".

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

degree

integer, optional
Coefficient for the 'poly' kernel function.
Only valid when kernel = 'poly'. Value range: >= 1.
Defaults to 3.

gamma

double, optional
Coefficient for the 'rbf' kernel function.
Only valid when kernel = 'rbf'.
Defaults to 1.0/number of features in the dataset.

coef.lin

double, optional
Coefficient for the 'poly' or 'sigmoid' kernel function.
Only valid when kernel = 'poly' or 'sigmoid'.
Defaults to 0.

coef.const

double, optional
Coefficient for the 'poly' or 'sigmoid' kernel function.
Only valid when kernel = 'poly' or 'sigmoid'
Defaults to 0.

c

double, optional
Trade-off between training error and margin.
value range: > 0.
Defaults to 100.

scale.info

character, optional

"no" : No scale
"standardization": The algorithm transforms the data to have zero mean and unit variance.
"rescale" : The algorithm rescales the range of the features to scale the range in [-1,1].

Defaults to "standardization".

shrink

logical, optional
Decides whether to use shrink strategy or not.Using shrink strategy may accelerate the training process.

FALSE: Does not use shrink strategy.
TRUE: Uses shrink strategy.

Defaults to TRUE.

handle.missing

logical, optional
Specifies whether to impute the missing values of the input data or not. If set to FALSE, all rows with missing values will be deleted.
Defaults to TRUE.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

category.weight

double, optional
Represents the weight of category attributes. The value must be greater than 0.
Defaults to 0.707.

tol

double, optional
Specifies the error tolerance in the training process. The value must be greater than 0.
Defaults to 0.001.

evaluation.seed

integer, optional(deprecated)
The random seed in parameter selection(same as random.state). The value must be no less than 0.
If set to 0, then system time is used for random generation.
Defaults to 0. If evaluation.seed and random.state are set simultaneously, random.state takes higher priority.

probability

logical, optional
If you want to output probability when scoring, set this to TRUE.
Defaults to FALSE.

compression

logical, optional
Specifies if the model is stored in compressed format.
Default value depends on the SAP HANA Version.
Please refer to the corresponding documentation of SAP HANA PAL.

max.bits

integer, optional
The maximum number of bits to quantize continuous features.
Equivalent to use 2^max.bits bins.
Must be less than 31.
Only valid Valid only when the value of compression is TRUE.
Defaults to 12.

max.quantization.iter

integer, optional
The maximum iteration steps for quantization.
Only valid Valid only when the value of compression is TRUE.
Defaults to 1000.

resampling.method

character, optional
Specifies the resampling method for model evaluation or parameter selection.
Valid options are listed as follows:
"cv", "cv_sha", "cv_hyperband", "bootstrap", "bootstrap_sha", "bootstrap_hyperband".
Note that successive-halving(SHA) or hyperband resampling methods are applicable to parameter selection only, not model evaluation.
If no value is specified, neither model evaluation nor parameter selection is activated.

evaluation.metric

character, optional
Specifies the evaluation metric for model evaluation or parameter selection.
Currently the only valid option for SVM ranking is "error_rate".
Must be specified together with resampling.method to activate model evaluation or parameter selection.

fold.num

integer, optional
Specifies the fold number for the cross-validation(cv). Mandatory and valid only when resampling.method is "cv", "cv_sha" or "cv_hyperband".

repeat.times

numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.

param.search.strategy

c("grid", "random"), optional
Specifies the method to activate parameter selection.
If not specified, model parameter selection shall not be triggered.
Defaults to "random" and cannot be changed if resampling.method is set as "cv_hyperband" or "bootstrap_hyperband"; otherwise no default value.

random.search.times

integer, optional
Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid only when param.search.strategy is set as "random", or when resampling.method is set as "cv_hyperband" or "bootstrap_hyperband".

random.state

numeric, optional
Specifies the seed for random generation.
Use system time when 0 is specified.

timeout

integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds.
No timeout when 0 is specified.

progress.indicator.id

character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.

parameter.range

list, optional
Specifies range of the following parameters for parameter selection:
c, gamma, degree, coef.lin, coef.const.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(c = c(50, 10, 100)), which means taking c values from 50 to 100 with 10 being the step size, i.e. 50, 60, 70, 80, 90, 100.
If param.search.strategy is 'random', then the middle term, i.e. step has no effect and thus can be omitted.

parameter.values

list, optional
Specifies values of the following parameters for parameter selection:
c, gamma, degree, coef.lin, coef.const.
Example: parameter.values <- list(gamma = c(0.01, 0.05, 0.07))

reduction.rate

numeric, optional
Specifies the reduction rate of successive-halving(SHA) or hyperband method.
Valid only when resampling.method is specified as one of the following: "cv_sha", "cv_hyperband", "bootstrap_sha", "bootstrap_hyperband".
Defaults to 3.0.

aggressive.elimination

logical, optional
Specifies whether to perform aggressive elimination behavior for successive-halving algorithm or not.
When set to TRUE, it will eliminate more parameter candidates than expected(defined via reduction_rate).
This can enhance the run-time performance but could result in sub-optimal hyper-parameter candidate.
Valid only when resampling_method specified as "cv_sha", "bootstrap_sha".
Defaults to FALSE.

Value

An R6 object of class "SVRanking" with the following attributes and methods:

Attributes

model: DataFrame
Model content
stat: DataFrame
statistics.

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > srk <- hanaml.SVRanking(data=df)
   > srk$CreateModelState()

Arguments:

model: DataFrame
DataFrame containing the model for parsing. Defaults to self$model.
algorithm: character
Specifies the PAL algorithm associated with model. Defaults to self$pal.algorithm.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model.
Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > srk <- hanaml.SVRanking(data=df)
   > srk$CreateModelState()

After using the model state for real-time scoring, we can delete the state by calling:


   > svr$DelateModelState()

Arguments:

state: DataFrame
DataFrame containing the state info.
Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Examples

Call the function:


> svranking <- hanaml.SVRanking(data,
                                qid="QID",
                                gamma = 0.005)

Model Compression

When decision boundaries between different class are complicated, there could be a large number of support vectors involved in the final model, which leads to large model size. Model compression aims at reducing the size of the SVM model with minimum loss of accuracy.

Relevant Parameters

compression: This parameter serves as a trigger for SVM model compression. Set it to TRUE if you want to trained model is to be stored in compressed format.
max.bits: This parameter sets up the maximum number of bits to quantize values of support vectors, which is Equivalent to use $2^{max.bits}$ bins to quantize values of support vectors. Reducing the number of bins may affect the precision of support vectors and the accuracy in prediction.
max.quantization.iter: This parameter sets up the maximum iteration number of the quantization process. If the specified number is too small, the quantization process may fail to converge, which could affect the accuracy further.

Arguments

Value

Examples

Model Compression

See also