hanaml.FRM.Rd
hanaml.FRM is an R wrapper for SAP HANA PAL Factorized Polynomial Regression Models(FRM).
hanaml.FRM(
data = NULL,
key = NULL,
user.info = NULL,
item.info = NULL,
categorical.variable = NULL,
user.categorical.variable = NULL,
item.categorical.variable = NULL,
solver = NULL,
factor.num = NULL,
init.variance = NULL,
random.state = NULL,
learning.rate = NULL,
linear.lambda = NULL,
poly2.lambda = NULL,
max.iter = NULL,
sgd.tol = NULL,
sgd.exit.interval = NULL,
momentum = NULL,
thread.ratio = NULL,
resampling.method = NULL,
evaluation.metric = NULL,
fold.num = NULL,
param.search.strategy = NULL,
repeat.times = NULL,
progress.indicator.id = NULL,
random.search.times = NULL,
timeout = NULL,
parameter.values = NULL,
parameter.range = NULL,
reduction.rate = NULL,
min.resource.rate = NULL,
aggressive.elimination = NULL
)
DataFrame
DataFrame containting data of user-item interaction and global
side features, structured as follows:
ID
USER ID column
ITEM ID column
Side feature columns
feedback/rationg column
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
DataFrame
DataFrame containting information of side features about user,
structured as follows:
USER ID column
Side feature columns
DataFrame
DataFrame containting information of side features about item,
structured as follows:
ITEM ID column
Side feature columns
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.
list/vector of characters, optional
Name of columns user.info that correspond to
categorical variable even the data type is INTEGER.
By default, 'VARCHAR' or 'NVARCHAR' is category
variable, and 'INTEGER' or 'DOUBLE' is continuous variable.
list/vector of characters, optional
Name of columns item.info that correspond to
categorical variable even the data type is INTEGER.
By default, 'VARCHAR' or 'NVARCHAR' is category
variable, and 'INTEGER' or 'DOUBLE' is continuous variable.
c("sgd", "momentum", "nag", "adagrad"), optional
Specifies optimization solver used to train the model
"sgd"
Stochastic Gradient Descent
solver
"momentum"
Momentum.
"nag"
Nesterov Accelerated Gradient.
"adagrad"
Adaptive gradient algorithm.
Defaults to "sgd".
integer, optional
length of factor vectors.
Defaults to 8.
numeric, optional
Variance of the normal distribution used to initialize
the model parameters.
Defaults to 1e-2.
numeric, optional
Specifies the seed for random number generation, where
0 means current system time
is used as seed, and other values are simply real seed
values.
Defaults to 0.
numeric, optional
Specifies the learning rate/ step size for optimization
process.
If you set it to the default value 0, the function
chooses the step size automatically, based on a small
part of the dataset.
Defaults to 0.
numeric, optional
Specifies the penalization assigned to the L2 regularization term of linear weights.
Defaults to 1e-10.
numeric, optional
Specifies the penalization assigned to the L2 regularization term of quadratic factors.
Defaults to 1e-8.
integer, optional
Specifies the maximum number of iterations for
optimization process.
Defaults to 50.
numeric, optional
Specifies the stop criterion.
The algorithm exits when the cost function has not
decreased more than sgd.tol
in
sgd.exit.interval
steps.
Defaults to 1e-5.
numeric, optional
Specifies the stop criterion
.
The algorithm exits when the cost function has not
decreased more than sgd.tol
in sgd.exit.interval
steps.
Defaults to 5.
numeric, optional
Specifies the momentum value for Momentum or NAG method.
Only valid when solver is "momentum" or "nag".
Defaults to 0.9.
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
character, optional
specifies the resampling method for model
evaluation and parameter selection.
Valid options include:
"cv", "bootstrap", "cv_sha", "bootstrap_sha", "cv_hyperband",
"bootstrap_hyperband".
If no value is specified for this parameter, neither model
evaluation nor parameter selection is activated.
No default value.
"rmse", optional
Specifies the evaluation metric for model evaluation or
parameter selection, currently the only valid option is "rmse".
If not specified, neither model evaluation nor parameter
selection is activated.
integer, optional
Specifies the fold number for the cross-validation(cv).
Mandatory and valid only when resampling.method
is "cv", "cv_sha" or "cv_hyperband".
No default value.
{'grid', 'random'}, optional
Specifies the parameter search strategy to activate parameter selection.
Defaults to "random" and cannot be changed if resampling.method
is specified as "cv_hyperband" or "bootstrap_hyperband", otherwise no
default value.
numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.
character, optional
Sets an ID of progress indicator for model evaluation or
parameter selection.
No progress indicator is active if no value is provided.
integer, optional
Specifies the number of times to randomly select candidate
parameters for selection.
Mandatory and valid when param.search.strategy
is
set to "random".
integer, optional
Specifies maximum running time for model evaluation or
parameter selection in seconds.
No timeout when 0 is specified.
named list/vector, optional
Specifies values of the following parameters for parameter
selection:factor.num, linear.lambda,regularization,
momentum
.
named list/vector, optional
Specifies range of the following parameters for parameter
selection:factor.num, linear.lambda,regularization,
momentum
.
Parameter range should be specified by 3 numbers in the
form of c(start, step, end).
Examples:
parameter.range <- list(factor.num = c(3, 1, 6)).
numeric, optional
Specifies the reduction rate of available size of hyper-parameter candidates.
For each round, the available parameter candidate size will be divided by value of this parameter.
Thus valid value for this parameter must be greater than 1.0
Valid only when parameter selection is activated and resampling.method
is specified with suffix "sha" or "hyperband".
Defaults to 3.0.
numeric, optional
Specifies the minimum resource rate that should be used in SHA or hyperband iteration.
Valid only when parameter selection is activated and resampling.method
is specified with suffix "sha" or "hyperband".
Defaults to 0.0.
logical, optional
Specifies whether to perform aggressive elimination behavior for successive-halving algorithm or not.
When set to TRUE, it will eliminate more parameter candidates than
expected(defined via reduction.rate
).
This can enhance the run-time performance but could result in sub-optimal hyper-parameter candidate.
Valid only when resampling.method
is specified with suffix "sha".
Defaults to FALSE.
An R6 object of class "FRM", with the following attributes and methods:
Attributes
model.meta.data: DataFrame
meta data of the trained
model.
model: DataFrame
weights of the trained model
model.factors: DataFrame
Factors of the trained model.
iter.info: DataFrame
recorded information while the
optimization process.
stat: DataFrame
Statistics for model-evaluation/
parameter-selection.
Available only when model-evaluation/parameter selection
is enabled.
optim.param: DataFrame
shows the selected optimal
parameters.
Available only when model-evaluation/parameter selection
is enabled.
Methods
CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)
Usage:
> fpr <- hanaml.FRM(data=df)
> fpr$CreateModelState()
Arguments:
model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model
.
algorithm: character
Specifies the PAL algorithm associated with model
.
Defaults to self$pal.algorithm
.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func
.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model
.
Defaults to FALSE.
After calling this method, an attribute state
that contains the parsed info for model
shall be assigned
to the corresponding R6 object.
DeleteModelState(state=NULL)
Usage:
Assuming we have trained a hanaml
model and created its model state, like the following:
> fpr <- hanaml.FRM(data=df)
> fpr$CreateModelState()
After using the model state for real-time scoring, we can delete the state by calling:
> fpr$DelateModelState()
Arguments:
state: DataFrame
DataFrame containing the state info.
Defaults to self$state
.
After calling this method, the specified model state shall be cleaned up and associated memory be released.
Factorized Polynomial
Regression Models has been proven to be a powerfuk
tool for prediction applications such as recommendation.
It combines the advantages of polynomial regression
models with factorization models. Unlike SVM, it allows
reliable parameter estimation under very sparse data,
where just a few observations for higher-order effects are
available. Due to the factorization of those
higher-order interactions, FMs can be calculated with linear
complexity.
each user-item rating/transaction. For example, location or time of a movie was rated by or lent to a user.
each user, e.g. gender, age, education, etc.
each item, such as genre of a movie.
> data$Head(5)$Collect()
ID USER MOVIE TIMESTAMP RATING
1 1 A Movie1 3 4.8
2 2 A Movie2 3 4.0
3 3 A Movie4 1 4.0
4 4 A Movie5 2 4.0
5 5 A Movie6 3 4.8
> user.info$Collect()
USER USER_SIDE_FEATURE
1 NA NA
> item.info$Head(5)$Collect()
MOVIE GENRES
1 Movie1 Sci-Fi
2 Movie2 Drama,Romance
3 Movie3 Drama,Sci-Fi
4 Movie4 Crime,Drama
5 Movie5 Crime,Drama
Call the function:
FM <- hanaml.FRM(data = data,
user.info = user.info,
item.info = item.info,
categorical.variable = "TIMESTAMP",
resampling.method = "cv",
solver = "momentum",
learning.rate = 0,
max.iter = 100,
param.search.strategy = "grid",
evaluation.metric = "rmse",
fold.num = 5, repeat.times = 1, timeout = 0,
progress.indicator.id = "PAL_FRM",
thread.ratio = 0.5, random.state = 1,
parameter.range = list(factor.num= c(1,1,3)),
parameter.values = list(linear.lambda = c(1e-6, 1e-8, 1e-10),
poly2.lambda = c(1e-4, 1e-6, 1e-8),
momentum = c(0.8, 0.9)))
Output:
> FM$model$Head(5)$Collect()
ID MAP WEIGHT
1 0 A -0.01932846
2 1 B 0.73047553
3 2 C -0.22821216
4 3 D 0.05358953
5 4 E 0.03182115
> FM$optim.param$Collect()
PARAM_NAME INT_VALUE numeric_VALUE STRING_VALUE
1 MOMENTUM NA 9e-01 <NA>
2 REGULARIZATION NA 1e-08 <NA>
3 LINEAR_REGULARIZATION NA 1e-06 <NA>
4 FACTOR_NUMBER 1 NA <NA>