hanaml.FFMRanker is an R wrapper for SAP HANA PAL FFM for ranking problem.

hanaml.FFMRanker(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  categorical.variable = NULL,
  delimiter = NULL,
  ordering = NULL,
  normalize = NULL,
  include.constant = NULL,
  include.linear = NULL,
  early.stop = NULL,
  factor.num = NULL,
  train.ratio = NULL,
  learning.rate = NULL,
  random.state = NULL,
  max.iter = NULL,
  linear.lambda = NULL,
  poly2.lambda = NULL,
  sgd.tol = NULL,
  sgd.exit.interval = NULL,
  handle.missing = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

  • "VARCHAR" and "NVARCHAR": categorical

  • "INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

delimiter

character, optional
The delimiter to separate string features. For example, "China, USA" indicates two feature values "China" and "USA". Valid only for string feature.
Default to "," (comma)

ordering

list/vector of characters/integers, optional
Specifies the orders of categorical variable in ascending order.
By default, characters are ordered alphabetically, while integers ordered numerically.
No default value.

normalize

logical, optional
Specifies whether to normalize each instance so that its L1 norm is 1.
Defaults to TRUE.

include.constant

logical, optional
Specifies whether or not to include the constant part in FFM model.
Defaults to TRUE.

include.linear

logical, optional
Specifies whether or not to include the linear weights in FFM model.
Defaults to TRUE.

early.stop

logical, optional
Specifies whether or not to early stop the SGD optimization.
Always TRUE, if train.ratio is less than 1.
Defaults to TRUE.

factor.num

integer, optional
length of factor vectors.
Defaults to 4.

train.ratio

double, optional
The ratio of training data set, and the remaining data set for validation. For example, 0.8 indicates that 80 0.8 if number of instances not less than 40, 1.0 otherwise.

learning.rate

double, optional
Specifies the learning rate/ step size for optimization process.
Defaults to 0.2.

random.state

double, optional
Specifies the seed for random number generation, where 0 means current system time is used as seed, and other values are simply real seed values.
Defaults to 0.

max.iter

integer, optional
Specifies the maximum number of iterations for optimization process.
Defaults to 20.

linear.lambda

double, optional
Specifies the penalization assigned to the L2 regularization term for linear weights.
Defaults to 1e-5.

poly2.lambda

double, optional
Specifies the penalization assigned to the L2 regularization term for quadratic factors.
Defaults to 1e-5.

sgd.tol

double, optional
Specifies the stopping criterion for SGD algorithm.
The algorithm exits when the cost function has not decreased more than sgd.tol in sgd.exit.interval steps.
Defaults to 1e-5.

sgd.exit.interval

double, optional
Specifies the stop criterion for SGD algorithm.
The algorithm exits when the cost function has not decreased more than sgd.tol in sgd.exit.interval steps.
Defaults to 5.

handle.missing

c("remove", "replace"), optional
Specifies how to handle missing features of data:

  • "remove" remove missing rows

  • "replace" replace missing rows with 0

Value

A "FFMRanker" object with the following attributes:

  • meta: DataFrame
    meta data of the trained model.

  • coef: DataFrame
    coefficient of the trained model

  • stats: DataFrame
    statistical information about the trained model.

Details

FFM has been proven to be a powerful tool for CTR and CVR prediction task. Based on FM models that reduce weights for sparse higher-order interactions to vectors using matrix factorization, the Field-Aware Factorization Machine introduces the concept of field, with which we represent a group of similar features, e.g., the field of user properties includes gender, age, occupation, etc.
By making factor vectors related not only to features but also to fields, the model has to learn a vector representation for each field. By doing so, we increase the complexity of the model to O(kn^2) where n is the number of data, and k is the factor number, i.e., length of the factor vectors.
In practice, we consider features spanned from the same categorical variable as of the same field. It is noted that FFM is most suited to categorical features. A numeric feature is either regarded as a single field or discretized to categorical. If all features are numeric and treated as every single feature, which means each field consists of only one feature, FFM degenerates to FM.
FFM can be applied to a variety of prediction tasks, for example, binary classification, regression, and ranking.

Examples


> data$Head(5)$Collect()
  ID USER  MOVIE TIMESTAMP      RANK
1  1    A Movie1         3    medium
2  2    A Movie2         3  too high
3  3    A Movie4         1    medium
4  4    A Movie5         2   too low
5  5    A Movie6         3       low

Call the function:

FFMRank <- hanaml.FFMRanker(data = data,
                            categorical.variable = "TIMESTAMP",
                            ordering = list("too low", "low",
                                            "medium", "high",
                                            "too high"),
                            delimiter = ",", factor.num = 4,
                            early.stop = TRUE, learning.rate = 0.2,
                            max.iter = 20, train.ratio = 0.8,
                            linear.lambda = 1e-5,
                            poly2.lambda = 1e-6, random.state = 1)

Output:


> FFMRank$coefficient$Head(5)$Collect()
  COEFF_INDEX         FEATURE FIELD  K COEFFICIENT
1           0   c:too low|low  <NA> NA -0.30807566
2           1    c:low|medium  <NA> NA  0.62193219
3           2   c:medium|high  <NA> NA  1.47438523
4           3 c:high|too high  <NA> NA  2.50841306
5           4          USER:A  <NA> NA -0.03842879