Factorized Polynomial Regression Models

hanaml.FRM is an R wrapper for SAP HANA PAL Factorized Polynomial Regression Models(FRM).

hanaml.FRM(
  data = NULL,
  key = NULL,
  user.info = NULL,
  item.info = NULL,
  categorical.variable = NULL,
  user.categorical.variable = NULL,
  item.categorical.variable = NULL,
  solver = NULL,
  factor.num = NULL,
  init.variance = NULL,
  random.state = NULL,
  learning.rate = NULL,
  linear.lambda = NULL,
  poly2.lambda = NULL,
  max.iter = NULL,
  sgd.tol = NULL,
  sgd.exit.interval = NULL,
  momentum = NULL,
  thread.ratio = NULL,
  resampling.method = NULL,
  evaluation.metric = NULL,
  fold.num = NULL,
  param.search.strategy = NULL,
  repeat.times = NULL,
  progress.indicator.id = NULL,
  random.search.times = NULL,
  timeout = NULL,
  parameter.values = NULL,
  parameter.range = NULL
)

Arguments

data	`DataFrame` DataFrame containting data of user-item interaction and global side features, structured as follows: ID USER ID column ITEM ID column Side feature columns feedback/rationg column
key	`character, optional` Name of the ID column. If not provided, the data is assumed to have no ID column. No default value.
user.info	`DataFrame` DataFrame containting information of side features about user, structured as follows: USER ID column Side feature columns
item.info	`DataFrame` DataFrame containting information of side features about item, structured as follows: ITEM ID column Side feature columns
categorical.variable	`character or list/vector of characters, optional` Indicates features should be treated as categorical variable. The default behavior is dependent on what input is given: "VARCHAR" and "NVARCHAR": categorical "INTEGER" and "DOUBLE": continuous. VALID only for variables of "INTEGER" type, omitted otherwise. No default value.
user.categorical.variable	`list/vector of characters, optional` Name of columns user.info that correspond to categorical variable even the data type is INTEGER. By default, 'VARCHAR' or 'NVARCHAR' is category variable, and 'INTEGER' or 'DOUBLE' is continuous variable.
item.categorical.variable	`list/vector of characters, optional` Name of columns item.info that correspond to categorical variable even the data type is INTEGER. By default, 'VARCHAR' or 'NVARCHAR' is category variable, and 'INTEGER' or 'DOUBLE' is continuous variable.
solver	`c("sgd", "momentum", "nag", "adagrad"), optional` Specifies optimization solver used to train the model `"sgd"` Stochastic Gradient Descent solver `"momentum"` Momentum. `"nag"` Nesterov Accelerated Gradient. `"adagrad"` Adaptive gradient algorithm. Defaults to "sgd".
factor.num	`integer, optional` length of factor vectors. Defaults to 8.
init.variance	`numeric, optional` Variance of the normal distribution used to initialize the model parameters. Defaults to 1e-2.
random.state	`numeric, optional` Specifies the seed for random number generation, where 0 means current system time is used as seed, and other values are simply real seed values. Defaults to 0.
learning.rate	`numeric, optional` Secifies the learning rate/ step size for optimization process. If you set it to the default value 0, the function chooses the step size automatically, based on a small part of the dataset. Defaults to 0.
linear.lambda	`numeric, optional` Specifies the penalization assigned to the L2 regularization term of linear weights. Defaults to 1e-10.
poly2.lambda	`numeric, optional` Specifies the penalization assigned to the L2 regularization term of quadratic factors. Defaults to 1e-8.
max.iter	`integer, optional` Specifies the maximum number of iterations for optimization process. Defaults to 50.
sgd.tol	`numeric, optional` Specifies the stop creteria. The algorithm exits when the cost function has not decreased more than sgd.tol in sgd.exit.interval steps. Defaults to 1e-5.
sgd.exit.interval	`numeric, optional` Specifies the stop creteria. The algorithm exits when the cost function has not decreased more than sgd.tol in sgd.exit.interval steps. Defaults to 5.
momentum	`numeric, optional` Specifies the momentum value for Momemtum or NAG method. Only valid when solver is "momentum" or "nag". Defaults to 0.9.
thread.ratio	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored. Defaults to 0.
resampling.method	`{"cv", "bootstrap"}, optional` specifies the resampling values to perform model evaluation and parameter selection. If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.
evaluation.metric	`{"rmse"}, optional` Specifies the evaluation metric for model evaluation or parameter selection, only RMSE is supported. If not specified, neither model evaluation nor parameter selection is activated.
fold.num	`integer, optional` Specifies the fold number for the cross-validation(cv). Mandatory and valid only when `resampling.method` is "cv".
param.search.strategy	`{'grid', 'random'}, optional` Specifies the method to activate parameter selection. If not specified, parameter selection shall not be triggered.
repeat.times	`numeric, optional` Specifies the number of repeat times for resampling. Defaults to 1.
progress.indicator.id	`character, optional` Sets an ID of progress indicator for model evaluation or parameter selection. No progress indicator is active if no value is provided.
random.search.times	`integer, optional` Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid when `param.search.strategy` is set to "random".
timeout	`integer, optional` Specifies maximum running time for model evaluation or parameter selection in seconds. No timeout when 0 is specified.
parameter.values	`named list/vector, optional` Specifies values of the following parameters for parameter selection: `factor.num, linear.lambda,regularization, momentum`.
parameter.range	`named list/vector, optional` Specifies range of the following parameters for parameter selection: `factor.num, linear.lambda,regularization, momentum`. Parameter range should be specified by 3 numbers in the form of c(start, step, end). Examples: parameter.range <- list(factor.num = c(3, 1, 6)).

Value

A "FRM" object with the following attributes:

model.meta.data: DataFrame meta data of the trained model.
model: DataFrame weights of the trained model
model.factors: DataFrame Factors of the trained model.
iter.info: DataFrame recorded information while the optimazation process.
stat: DataFrame Statistics for model-evaluation/ parameter-selection. Available only when model-evaluation/parameter selection is enabled.
optim.param: DataFrame shows the selected optimal parameters. Available only when model-evaluation/parameter selection is enabled.

Details

Factorized Polynomial Regression Models has been proven to be a powerfuk tool for prediction applications such as recommendation. It combines the advantages of polynomial regression models with factorization models. Unlike SVM, it allows reliable parameter estimation under very sparse data, where just a few observations for higher-order effects are available. Due to the factorization of those higher-order interactions, FMs can be calculated with linear complexity.

each user-item rating/transaction. For example, location or time of a movie was rated by or lent to a user
each user, e.g. gender, age, education, etc
each item, such as genre of a movie

Examples

> data$Head(5)$Collect()
 ID USER    MOVIE TIMESTAMP RATING
1 1    A   Movie1         3    4.8
2 2    A   Movie2         3    4.0
3 3    A   Movie4         1    4.0
4 4    A   Movie5         2    4.0
5 5    A   Movie6         3    4.8

> user.info$Collect()
     USER     USER_SIDE_FEATURE
1      NA                    NA

> item.info$Head(5)$Collect()
   MOVIE  GENRES
1 Movie1  Sci-Fi
2 Movie2  Drama,Romance
3 Movie3  Drama,Sci-Fi
4 Movie4  Crime,Drama
5 Movie5  Crime,Drama

Call the function:

FM <- hanaml.FRM(data = data,
                 user.info = user.info,
                 item.info = item.info,
                 categorical.variable = "TIMESTAMP",
                 resampling.method = "cv",
                 solver = "momentum",
                 learning.rate = 0,
                 max.iter = 100,
                 param.search.strategy = "grid",
                 evaluation.metric = "rmse",
                 fold.num = 5, repeat.times = 1, timeout = 0,
                 progress.indicator.id = "PAL_FRM",
                 thread.ratio = 0.5, random.state = 1,
                 parameter.range = list(factor.num= c(1,1,3)),
                 parameter.values = list(linear.lambda = c(1e-6, 1e-8, 1e-10),
                                         poly2.lambda = c(1e-4, 1e-6, 1e-8),
                                         momentum = c(0.8, 0.9)))

Output:

> FM$model$Head(5)$Collect()
 ID MAP      WEIGHT
1 0   A -0.01932846
2 1   B  0.73047553
3 2   C -0.22821216
4 3   D  0.05358953
5 4   E  0.03182115

> FM$optim.param$Collect()
             PARAM_NAME INT_VALUE numeric_VALUE STRING_VALUE
1              MOMENTUM        NA         9e-01         <NA>
2        REGULARIZATION        NA         1e-08         <NA>
3 LINEAR_REGULARIZATION        NA         1e-06         <NA>
4         FACTOR_NUMBER         1            NA         <NA>

Arguments

Value

Details

Examples

See also