hanaml.FFMRanker.Rd
hanaml.FFMRanker is an R wrapper for SAP HANA PAL FFM for ranking problem.
hanaml.FFMRanker(
data = NULL,
key = NULL,
features = NULL,
label = NULL,
categorical.variable = NULL,
delimiter = NULL,
ordering = NULL,
normalize = NULL,
include.constant = NULL,
include.linear = NULL,
early.stop = NULL,
factor.num = NULL,
train.ratio = NULL,
learning.rate = NULL,
random.state = NULL,
max.iter = NULL,
linear.lambda = NULL,
poly2.lambda = NULL,
sgd.tol = NULL,
sgd.exit.interval = NULL,
handle.missing = NULL
)
DataFrame
DataFrame containting the data.
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.
character, optional
The delimiter to separate string features. For example,
"China, USA" indicates two feature values "China"
and "USA".
Valid only for string feature.
Default to "," (comma)
list/vector of characters/integers, optional
Specifies the orders of categorical variable in ascending order.
By default, characters are ordered alphabetically, while integers
ordered numerically.
No default value.
logical, optional
Specifies whether to normalize each instance so
that its L1 norm is 1.
Defaults to TRUE.
logical, optional
Specifies whether or not to include the constant part in FFM model.
Defaults to TRUE.
logical, optional
Specifies whether or not to include the linear weights in FFM model.
Defaults to TRUE.
logical, optional
Specifies whether or not to early stop the SGD
optimization.
Always TRUE, if train.ratio
is less than 1.
Defaults to TRUE.
integer, optional
length of factor vectors.
Defaults to 4.
double, optional
The ratio of training data set, and the remaining data
set for validation. For example, 0.8 indicates that
80
0.8 if number of instances not less than 40, 1.0 otherwise.
double, optional
Specifies the learning rate/ step size for optimization
process.
Defaults to 0.2.
double, optional
Specifies the seed for random number generation,
where 0 means current system time
is used as seed, and other values are simply real
seed values.
Defaults to 0.
integer, optional
Specifies the maximum number of iterations for
optimization process.
Defaults to 20.
double, optional
Specifies the penalization assigned to the L2 regularization term for linear weights.
Defaults to 1e-5.
double, optional
Specifies the penalization assigned to the L2 regularization term for quadratic factors.
Defaults to 1e-5.
double, optional
Specifies the stopping criterion for SGD algorithm.
The algorithm exits when the cost function has not
decreased more than sgd.tol
in sgd.exit.interval
steps.
Defaults to 1e-5.
double, optional
Specifies the stop criterion for SGD algorithm.
The algorithm exits when the cost function has not
decreased more than sgd.tol
in sgd.exit.interval
steps.
Defaults to 5.
c("remove", "replace"), optional
Specifies how to handle missing features of data
:
"remove"
remove missing rows
"replace"
replace missing rows with 0
A "FFMRanker" object with the following attributes:
meta: DataFrame
meta data of the
trained model.
coef: DataFrame
coefficient of the
trained model
stats: DataFrame
statistical information about the
trained model.
FFM has been proven to be a powerful tool for CTR and CVR prediction task.
Based on FM models that reduce weights for sparse higher-order interactions
to vectors using matrix factorization, the Field-Aware Factorization Machine
introduces the concept of field, with which we represent a group of similar
features, e.g., the field of user properties includes gender, age,
occupation, etc.
By making factor vectors related not only to features but
also to fields, the model has to learn a vector representation
for each field.
By doing so, we increase the complexity of the model to O(kn^2) where n is
the number of data, and k is the factor number, i.e., length of the factor
vectors.
In practice, we consider features spanned from the same categorical
variable as of the same field. It is noted that FFM is most suited
to
categorical features. A numeric feature is either regarded as a
single field or
discretized to categorical. If all features are numeric and treated
as every single
feature, which means each field consists of only one feature,
FFM degenerates to FM.
FFM can be applied to a variety of prediction tasks,
for example, binary classification, regression, and ranking.
> data$Head(5)$Collect()
ID USER MOVIE TIMESTAMP RANK
1 1 A Movie1 3 medium
2 2 A Movie2 3 too high
3 3 A Movie4 1 medium
4 4 A Movie5 2 too low
5 5 A Movie6 3 low
Call the function:
FFMRank <- hanaml.FFMRanker(data = data,
categorical.variable = "TIMESTAMP",
ordering = list("too low", "low",
"medium", "high",
"too high"),
delimiter = ",", factor.num = 4,
early.stop = TRUE, learning.rate = 0.2,
max.iter = 20, train.ratio = 0.8,
linear.lambda = 1e-5,
poly2.lambda = 1e-6, random.state = 1)
Output:
> FFMRank$coefficient$Head(5)$Collect()
COEFF_INDEX FEATURE FIELD K COEFFICIENT
1 0 c:too low|low <NA> NA -0.30807566
2 1 c:low|medium <NA> NA 0.62193219
3 2 c:medium|high <NA> NA 1.47438523
4 3 c:high|too high <NA> NA 2.50841306
5 4 USER:A <NA> NA -0.03842879