hanaml.FRM is an R wrapper for SAP HANA PAL Factorized Polynomial Regression Models(FRM).

hanaml.FRM(
  data = NULL,
  key = NULL,
  user.info = NULL,
  item.info = NULL,
  categorical.variable = NULL,
  user.categorical.variable = NULL,
  item.categorical.variable = NULL,
  solver = NULL,
  factor.num = NULL,
  init.variance = NULL,
  random.state = NULL,
  learning.rate = NULL,
  linear.lambda = NULL,
  poly2.lambda = NULL,
  max.iter = NULL,
  sgd.tol = NULL,
  sgd.exit.interval = NULL,
  momentum = NULL,
  thread.ratio = NULL,
  resampling.method = NULL,
  evaluation.metric = NULL,
  fold.num = NULL,
  param.search.strategy = NULL,
  repeat.times = NULL,
  progress.indicator.id = NULL,
  random.search.times = NULL,
  timeout = NULL,
  parameter.values = NULL,
  parameter.range = NULL,
  reduction.rate = NULL,
  min.resource.rate = NULL,
  aggressive.elimination = NULL
)

Arguments

data

DataFrame
DataFrame containting data of user-item interaction and global side features, structured as follows:

  • ID

  • USER ID column

  • ITEM ID column

  • Side feature columns

  • feedback/rationg column

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

user.info

DataFrame
DataFrame containting information of side features about user, structured as follows:

  • USER ID column

  • Side feature columns

item.info

DataFrame
DataFrame containting information of side features about item, structured as follows:

  • ITEM ID column

  • Side feature columns

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

  • "VARCHAR" and "NVARCHAR": categorical

  • "INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

user.categorical.variable

list/vector of characters, optional
Name of columns user.info that correspond to categorical variable even the data type is INTEGER.
By default, 'VARCHAR' or 'NVARCHAR' is category variable, and 'INTEGER' or 'DOUBLE' is continuous variable.

item.categorical.variable

list/vector of characters, optional
Name of columns item.info that correspond to categorical variable even the data type is INTEGER.
By default, 'VARCHAR' or 'NVARCHAR' is category variable, and 'INTEGER' or 'DOUBLE' is continuous variable.

solver

c("sgd", "momentum", "nag", "adagrad"), optional
Specifies optimization solver used to train the model

  • "sgd" Stochastic Gradient Descent solver

  • "momentum" Momentum.

  • "nag" Nesterov Accelerated Gradient.

  • "adagrad" Adaptive gradient algorithm.

Defaults to "sgd".

factor.num

integer, optional
length of factor vectors.
Defaults to 8.

init.variance

numeric, optional
Variance of the normal distribution used to initialize the model parameters.
Defaults to 1e-2.

random.state

numeric, optional
Specifies the seed for random number generation, where 0 means current system time is used as seed, and other values are simply real seed values.
Defaults to 0.

learning.rate

numeric, optional
Specifies the learning rate/ step size for optimization process. If you set it to the default value 0, the function chooses the step size automatically, based on a small part of the dataset.
Defaults to 0.

linear.lambda

numeric, optional
Specifies the penalization assigned to the L2 regularization term of linear weights.
Defaults to 1e-10.

poly2.lambda

numeric, optional
Specifies the penalization assigned to the L2 regularization term of quadratic factors.
Defaults to 1e-8.

max.iter

integer, optional
Specifies the maximum number of iterations for optimization process.
Defaults to 50.

sgd.tol

numeric, optional
Specifies the stop criterion. The algorithm exits when the cost function has not decreased more than sgd.tol in sgd.exit.interval steps.
Defaults to 1e-5.

sgd.exit.interval

numeric, optional
Specifies the stop criterion. The algorithm exits when the cost function has not decreased more than sgd.tol in sgd.exit.interval steps.
Defaults to 5.

momentum

numeric, optional
Specifies the momentum value for Momentum or NAG method. Only valid when solver is "momentum" or "nag".
Defaults to 0.9.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

resampling.method

character, optional
specifies the resampling method for model evaluation and parameter selection.
Valid options include:
"cv", "bootstrap", "cv_sha", "bootstrap_sha", "cv_hyperband", "bootstrap_hyperband".
If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.
No default value.

evaluation.metric

"rmse", optional
Specifies the evaluation metric for model evaluation or parameter selection, currently the only valid option is "rmse".
If not specified, neither model evaluation nor parameter selection is activated.

fold.num

integer, optional
Specifies the fold number for the cross-validation(cv). Mandatory and valid only when resampling.method is "cv", "cv_sha" or "cv_hyperband".
No default value.

param.search.strategy

{'grid', 'random'}, optional
Specifies the parameter search strategy to activate parameter selection.
Defaults to "random" and cannot be changed if resampling.method is specified as "cv_hyperband" or "bootstrap_hyperband", otherwise no default value.

repeat.times

numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.

progress.indicator.id

character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.

random.search.times

integer, optional
Specifies the number of times to randomly select candidate parameters for selection.
Mandatory and valid when param.search.strategy is set to "random".

timeout

integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds. No timeout when 0 is specified.

parameter.values

named list/vector, optional
Specifies values of the following parameters for parameter selection:
factor.num, linear.lambda,regularization, momentum.

parameter.range

named list/vector, optional
Specifies range of the following parameters for parameter selection:
factor.num, linear.lambda,regularization, momentum.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(factor.num = c(3, 1, 6)).

reduction.rate

numeric, optional
Specifies the reduction rate of available size of hyper-parameter candidates.
For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0
Valid only when parameter selection is activated and resampling.method is specified with suffix "sha" or "hyperband".
Defaults to 3.0.

min.resource.rate

numeric, optional
Specifies the minimum resource rate that should be used in SHA or hyperband iteration.
Valid only when parameter selection is activated and resampling.method is specified with suffix "sha" or "hyperband".
Defaults to 0.0.

aggressive.elimination

logical, optional
Specifies whether to perform aggressive elimination behavior for successive-halving algorithm or not.
When set to TRUE, it will eliminate more parameter candidates than expected(defined via reduction.rate).
This can enhance the run-time performance but could result in sub-optimal hyper-parameter candidate.
Valid only when resampling.method is specified with suffix "sha". Defaults to FALSE.

Value

An R6 object of class "FRM", with the following attributes and methods:

Attributes

  • model.meta.data: DataFrame meta data of the trained model.

  • model: DataFrame weights of the trained model

  • model.factors: DataFrame Factors of the trained model.

  • iter.info: DataFrame recorded information while the optimization process.

  • stat: DataFrame Statistics for model-evaluation/ parameter-selection. Available only when model-evaluation/parameter selection is enabled.

  • optim.param: DataFrame shows the selected optimal parameters. Available only when model-evaluation/parameter selection is enabled.

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > fpr <- hanaml.FRM(data=df)
   > fpr$CreateModelState()


Arguments:

  • model: DataFrame
    DataFrame containing the model for parsing.
    Defaults to self$model.

  • algorithm: character
    Specifies the PAL algorithm associated with model.
    Defaults to self$pal.algorithm.

  • func: character
    Specifies the functionality for Unified Classification/Regression.
    Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
    Defaults to self$func.

  • state.description: character
    A summary string for the generated model state.
    Defaults to "ModelState".

  • force: logic
    Specifies whether or not the replace existing state for model.
    Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > fpr <- hanaml.FRM(data=df)
   > fpr$CreateModelState()


After using the model state for real-time scoring, we can delete the state by calling:


   > fpr$DelateModelState()


Arguments:

  • state: DataFrame
    DataFrame containing the state info.
    Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Details

Factorized Polynomial Regression Models has been proven to be a powerfuk tool for prediction applications such as recommendation. It combines the advantages of polynomial regression models with factorization models. Unlike SVM, it allows reliable parameter estimation under very sparse data, where just a few observations for higher-order effects are available. Due to the factorization of those higher-order interactions, FMs can be calculated with linear complexity.

  • each user-item rating/transaction. For example, location or time of a movie was rated by or lent to a user.

  • each user, e.g. gender, age, education, etc.

  • each item, such as genre of a movie.

Examples


> data$Head(5)$Collect()
 ID USER    MOVIE TIMESTAMP RATING
1 1    A   Movie1         3    4.8
2 2    A   Movie2         3    4.0
3 3    A   Movie4         1    4.0
4 4    A   Movie5         2    4.0
5 5    A   Movie6         3    4.8

> user.info$Collect()
     USER     USER_SIDE_FEATURE
1      NA                    NA

> item.info$Head(5)$Collect()
   MOVIE  GENRES
1 Movie1  Sci-Fi
2 Movie2  Drama,Romance
3 Movie3  Drama,Sci-Fi
4 Movie4  Crime,Drama
5 Movie5  Crime,Drama

Call the function:

FM <- hanaml.FRM(data = data,
                 user.info = user.info,
                 item.info = item.info,
                 categorical.variable = "TIMESTAMP",
                 resampling.method = "cv",
                 solver = "momentum",
                 learning.rate = 0,
                 max.iter = 100,
                 param.search.strategy = "grid",
                 evaluation.metric = "rmse",
                 fold.num = 5, repeat.times = 1, timeout = 0,
                 progress.indicator.id = "PAL_FRM",
                 thread.ratio = 0.5, random.state = 1,
                 parameter.range = list(factor.num= c(1,1,3)),
                 parameter.values = list(linear.lambda = c(1e-6, 1e-8, 1e-10),
                                         poly2.lambda = c(1e-4, 1e-6, 1e-8),
                                         momentum = c(0.8, 0.9)))

Output:


> FM$model$Head(5)$Collect()
 ID MAP      WEIGHT
1 0   A -0.01932846
2 1   B  0.73047553
3 2   C -0.22821216
4 3   D  0.05358953
5 4   E  0.03182115

> FM$optim.param$Collect()
             PARAM_NAME INT_VALUE numeric_VALUE STRING_VALUE
1              MOMENTUM        NA         9e-01         <NA>
2        REGULARIZATION        NA         1e-08         <NA>
3 LINEAR_REGULARIZATION        NA         1e-06         <NA>
4         FACTOR_NUMBER         1            NA         <NA>

See also