hanaml.FRM is an R wrapper
for SAP HANA PAL Factorized Polynomial Regression Models(FRM).
hanaml.FRM(
data = NULL,
key = NULL,
user.info = NULL,
item.info = NULL,
categorical.variable = NULL,
user.categorical.variable = NULL,
item.categorical.variable = NULL,
solver = NULL,
factor.num = NULL,
init.variance = NULL,
random.state = NULL,
learning.rate = NULL,
linear.lambda = NULL,
poly2.lambda = NULL,
max.iter = NULL,
sgd.tol = NULL,
sgd.exit.interval = NULL,
momentum = NULL,
thread.ratio = NULL,
resampling.method = NULL,
evaluation.metric = NULL,
fold.num = NULL,
param.search.strategy = NULL,
repeat.times = NULL,
progress.indicator.id = NULL,
random.search.times = NULL,
timeout = NULL,
parameter.values = NULL,
parameter.range = NULL
)
Arguments
| data |
DataFrame
DataFrame containting data of user-item interaction and global
side features, structured as follows:
ID
USER ID column
ITEM ID column
Side feature columns
feedback/rationg column
|
| key |
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
|
| user.info |
DataFrame
DataFrame containting information of side features about user,
structured as follows:
USER ID column
Side feature columns
|
| item.info |
DataFrame
DataFrame containting information of side features about item,
structured as follows:
ITEM ID column
Side feature columns
|
| categorical.variable |
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value. |
| user.categorical.variable |
list/vector of characters, optional
Name of columns user.info that correspond to
categorical variable even the data type is INTEGER.
By default, 'VARCHAR' or 'NVARCHAR' is category
variable, and 'INTEGER' or 'DOUBLE' is continuous variable.
|
| item.categorical.variable |
list/vector of characters, optional
Name of columns item.info that correspond to
categorical variable even the data type is INTEGER.
By default, 'VARCHAR' or 'NVARCHAR' is category
variable, and 'INTEGER' or 'DOUBLE' is continuous variable.
|
| solver |
c("sgd", "momentum", "nag", "adagrad"), optional
Specifies optimization solver used to train the model
"sgd" Stochastic Gradient Descent
solver
"momentum" Momentum.
"nag" Nesterov Accelerated Gradient.
"adagrad" Adaptive gradient algorithm.
Defaults to "sgd". |
| factor.num |
integer, optional
length of factor vectors.
Defaults to 8.
|
| init.variance |
numeric, optional
Variance of the normal distribution used to initialize
the model parameters.
Defaults to 1e-2.
|
| random.state |
numeric, optional
Specifies the seed for random number generation, where
0 means current system time
is used as seed, and other values are simply real seed
values.
Defaults to 0.
|
| learning.rate |
numeric, optional
Secifies the learning rate/ step size for optimization
process.
If you set it to the default value 0, the function
chooses the step size automatically, based on a small
part of the dataset.
Defaults to 0.
|
| linear.lambda |
numeric, optional
Specifies the penalization assigned to the L2 regularization term of linear weights.
Defaults to 1e-10.
|
| poly2.lambda |
numeric, optional
Specifies the penalization assigned to the L2 regularization term of quadratic factors.
Defaults to 1e-8.
|
| max.iter |
integer, optional
Specifies the maximum number of iterations for
optimization process.
Defaults to 50.
|
| sgd.tol |
numeric, optional
Specifies the stop creteria.
The algorithm exits when the cost function has not
decreased more than sgd.tol in
sgd.exit.interval steps.
Defaults to 1e-5.
|
| sgd.exit.interval |
numeric, optional
Specifies the stop creteria.
The algorithm exits when the cost function has not
decreased more than sgd.tol in sgd.exit.interval steps.
Defaults to 5.
|
| momentum |
numeric, optional
Specifies the momentum value for Momemtum or NAG method.
Only valid when solver is "momentum" or "nag".
Defaults to 0.9.
|
| thread.ratio |
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
|
| resampling.method |
{"cv", "bootstrap"}, optional
specifies the resampling values to perform model
evaluation and parameter selection.
If no value is specified for this parameter, neither model
evaluation nor parameter selection is activated.
|
| evaluation.metric |
{"rmse"}, optional
Specifies the evaluation metric for model evaluation or
parameter selection, only RMSE is supported.
If not specified, neither model evaluation nor parameter
selection is activated.
|
| fold.num |
integer, optional
Specifies the fold number for the cross-validation(cv).
Mandatory and valid only when resampling.method
is "cv".
|
| param.search.strategy |
{'grid', 'random'}, optional
Specifies the method to activate parameter selection.
If not specified, parameter selection shall not be
triggered.
|
| repeat.times |
numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.
|
| progress.indicator.id |
character, optional
Sets an ID of progress indicator for model evaluation or
parameter selection.
No progress indicator is active if no value is provided.
|
| random.search.times |
integer, optional
Specifies the number of times to randomly select candidate
parameters for selection.
Mandatory and valid when param.search.strategy is
set to "random".
|
| timeout |
integer, optional
Specifies maximum running time for model evaluation or
parameter selection in seconds.
No timeout when 0 is specified.
|
| parameter.values |
named list/vector, optional
Specifies values of the following parameters for parameter
selection:
factor.num, linear.lambda,regularization,
momentum.
|
| parameter.range |
named list/vector, optional
Specifies range of the following parameters for parameter
selection:
factor.num, linear.lambda,regularization,
momentum.
Parameter range should be specified by 3 numbers in the
form of c(start, step, end).
Examples:
parameter.range <- list(factor.num = c(3, 1, 6)).
|
Value
A "FRM" object with the following attributes:
model.meta.data: DataFrame meta data of the trained
model.
model: DataFrame weights of the trained model
model.factors: DataFrame Factors of the trained model.
iter.info: DataFrame recorded information while the
optimazation process.
stat: DataFrame Statistics for model-evaluation/
parameter-selection.
Available only when model-evaluation/parameter selection
is enabled.
optim.param: DataFrame shows the selected optimal
parameters.
Available only when model-evaluation/parameter selection
is enabled.
Details
Factorized Polynomial
Regression Models has been proven to be a powerfuk
tool for prediction applications such as recommendation.
It combines the advantages of polynomial regression
models with factorization models. Unlike SVM, it allows
reliable parameter estimation under very sparse data,
where just a few observations for higher-order effects are
available. Due to the factorization of those
higher-order interactions, FMs can be calculated with linear
complexity.
each user-item rating/transaction. For
example, location or time of a movie was rated by
or lent to a user
each user, e.g. gender, age, education, etc
each item, such as genre of a movie
Examples
> data$Head(5)$Collect()
ID USER MOVIE TIMESTAMP RATING
1 1 A Movie1 3 4.8
2 2 A Movie2 3 4.0
3 3 A Movie4 1 4.0
4 4 A Movie5 2 4.0
5 5 A Movie6 3 4.8
> user.info$Collect()
USER USER_SIDE_FEATURE
1 NA NA
> item.info$Head(5)$Collect()
MOVIE GENRES
1 Movie1 Sci-Fi
2 Movie2 Drama,Romance
3 Movie3 Drama,Sci-Fi
4 Movie4 Crime,Drama
5 Movie5 Crime,Drama
Call the function:
FM <- hanaml.FRM(data = data,
user.info = user.info,
item.info = item.info,
categorical.variable = "TIMESTAMP",
resampling.method = "cv",
solver = "momentum",
learning.rate = 0,
max.iter = 100,
param.search.strategy = "grid",
evaluation.metric = "rmse",
fold.num = 5, repeat.times = 1, timeout = 0,
progress.indicator.id = "PAL_FRM",
thread.ratio = 0.5, random.state = 1,
parameter.range = list(factor.num= c(1,1,3)),
parameter.values = list(linear.lambda = c(1e-6, 1e-8, 1e-10),
poly2.lambda = c(1e-4, 1e-6, 1e-8),
momentum = c(0.8, 0.9)))
Output:
> FM$model$Head(5)$Collect()
ID MAP WEIGHT
1 0 A -0.01932846
2 1 B 0.73047553
3 2 C -0.22821216
4 3 D 0.05358953
5 4 E 0.03182115
> FM$optim.param$Collect()
PARAM_NAME INT_VALUE numeric_VALUE STRING_VALUE
1 MOMENTUM NA 9e-01 <NA>
2 REGULARIZATION NA 1e-08 <NA>
3 LINEAR_REGULARIZATION NA 1e-06 <NA>
4 FACTOR_NUMBER 1 NA <NA>
See also