FFMClassifier

class hana_ml.algorithms.pal.recommender.FFMClassifier(ordering=None, normalise=None, include_linear=None, include_constant=None, early_stop=None, random_state=None, factor_num=None, max_iter=None, train_size=None, learning_rate=None, linear_lamb=None, poly2_lamb=None, tol=None, exit_interval=None, handle_missing=None)

Field-Aware Factorization Machine with the task of classification.

Parameters:
factor_numint, optional

The factorization dimensionality. Default to 4.

random_stateint, optional

Specifies the seed for random number generator.

  • 0: Uses the current time as the seed.

  • Others: Uses the specified value as the seed.

Default to 0.

train_sizefloat, optional

The proportion of dataset used for training, and the remaining data set for validation.

For example, 0.8 indicates that 80% for training, and the remaining 20% for validation.

Default to 0.8 if number of instances not less than 40, 1.0 otherwise.

max_iterint, optional

Specifies the maximum number of iterations for the alternative least square algorithm.

Default to 20

orderinga list of str, optional(deprecated)

Specifies the categories orders for ranking.

This parameter is meaningless for classification problems and will be removed in future release.

No default value.

normalisebool, optional

Specifies whether to normalize each instance so that its L1 norm is 1.

Default to True.

include_constantbool, optional

Specifies whether to include the w0 constant part.

Default to True.

include_linearbool, optional

Specifies whether to include the linear part of regression model.

Default to True.

early_stopbool, optional

Specifies whether to early stop the SGD optimization.

Valid only if the value of thread_ratio is less than 1.

Default to True.

learning_ratefloat, optional

The learning rate for SGD iteration.

Default to 0.2.

linear_lambfloat, optional

The L2 regularization parameter for the linear coefficient vector.

Default to 1e-5.

poly2_lambfloat, optional

The L2 regularization parameter for factorized coefficient matrix of the quadratic term.

Default to 1e-5.

tolfloat, optional

The criterion to determine the convergence of SGD.

Default to 1e-5.

exit_intervalint, optional

The interval of two iterations for comparison to determine the convergence.

Default to 5.

handle_missingstr, optional

Specifies how to handle missing feature:

  • 'skip': skip rows with missing values.

  • 'fill_zero': replace missing values with 0.

Default to 'fill_zero'.

Examples

Input dataframe for classification training:

>>> df_train_classification.collect()
   USER                   MOVIE  TIMESTAMP        CTR
0     A                  Movie1        3.0      Click
1     A                  Movie2        3.0      Click
2     A                  Movie4        1.0  Not click
3     A                  Movie5        2.0      Click
4     A                  Movie6        3.0      Click
5     A                  Movie8        2.0  Not click
6     A          Movie0, Movie3        1.0      Click
7     B                  Movie2        3.0      Click
8     B                  Movie3        2.0      Click
9     B                  Movie4        2.0  Not click
10    B                    None        4.0  Not click
11    B                  Movie7        1.0      Click
12    B                  Movie8        2.0  Not click
13    B                  Movie0        3.0  Not click
14    C                  Movie1        2.0      Click
15    C  Movie2, Movie5, Movie7        4.0  Not click
16    C                  Movie4        3.0  Not click
17    C                  Movie5        1.0  Not click
18    C                  Movie6        NaN      Click
19    C                  Movie7        3.0  Not click
20    C                  Movie8        1.0      Click
21    C                  Movie0        2.0      Click
22    D                  Movie1        3.0      Click
23    D                  Movie3        2.0      Click
24    D          Movie4, Movie7        2.0      Click
25    D                  Movie6        2.0      Click
26    D                  Movie7        4.0  Not click
27    D                  Movie8        3.0  Not click
28    D                  Movie0        3.0  Not click
29    E                  Movie1        2.0  Not click
30    E                  Movie2        2.0      Click
31    E                  Movie3        2.0      Click
32    E                  Movie4        4.0      Click
33    E                  Movie5        3.0      Click
34    E                  Movie6        2.0  Not click
35    E                  Movie7        4.0  Not click
36    E                  Movie8        3.0  Not click

Creating FFMClassifier instance:

>>> ffm = FFMClassifier(linear_lamb=1e-5, poly2_lamb=1e-6, random_state=1,
              factor_num=4, early_stop=1, learning_rate=0.2, max_iter=20, train_size=0.8)

Performing fit() on given dataframe:

>>> ffm.fit(data=df_train_classification, categorical_variable='TIMESTAMP')
>>> ffm.stats_.collect()
     STAT_NAME          STAT_VALUE
0         task      classification
1  feature_num                  18
2    field_num                   3
3        k_num                   4
4     category    Click, Not click
5         iter                   3
6      tr-loss  0.6409316561278655
7      va-loss  0.7452354780967997

Performing predict() on given predicting dataframe:

>>> res = ffm.predict(data=df_predict, key='ID', thread_ratio=1)
>>> res.collect()
   ID      SCORE  CONFIDENCE
0   1  Not click    0.543537
1   2  Not click    0.545470
2   3      Click    0.542737
3   4      Click    0.519458
4   5      Click    0.511001
5   6  Not click    0.534610
6   7      Click    0.537739
7   8  Not click    0.536781
8   9  Not click    0.635412
Attributes:
meta_DataFrame

Model metadata content.

coef_DataFrame
DataFrame that provides the following information:
  • Feature name,

  • Field name,

  • The factorization number,

  • The parameter value.

stats_DataFrame

Statistic values.

cross_valid_DataFrame

Cross validation content.

Methods

fit(data[, key, features, label, ...])

Fit the FFMClassifier model with the input training data.

predict(data[, key, features, thread_ratio, ...])

Prediction for the input data with the trained FFMClassifier model.

fit(data, key=None, features=None, label=None, categorical_variable=None, delimiter=None)

Fit the FFMClassifier model with the input training data. Model parameters should be given by initializing the model first.

Parameters:
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

If key is not provided, then:

  • if data is indexed by a single column, then key defaults to that index column;

  • otherwise, it is assumed that data contains no ID column.

featuresstr or a list of str optional

Name of the feature columns.

delimiterstr, optional

The delimiter to separate string features.

For example, "China, USA" indicates two feature values "China" and "USA".

Default to ','.

labelstr, optional

Specifies the dependent variable.

For classification, the label column can be any kind of data type.

Default to last column name.

categorical_variablestr or a list of str optional

Indicates whether or not a column data is actually corresponding to a category variable even the data type of this column is INTEGER.

By default, 'VARCHAR' or 'NVARCHAR' is category variable, and 'INTEGER' or 'DOUBLE' is continuous variable.

Returns:
Fitted object.
predict(data, key=None, features=None, thread_ratio=None, handle_missing=None)

Prediction for the input data with the trained FFMClassifier model.

Parameters:
dataDataFrame

Data to be fit.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresstr or a list of str optional

Global side features column name in the training dataframe.

thread_ratiofloat, optional

The ratio of available threads.

  • 0: single thread

  • 0~1: percentage

  • Others: heuristically determined

Default to -1.

handle_missingstr, optional

Specifies how to handle missing feature:

  • 'skip': skip rows with missing values.

  • 'fill_zero': replace missing values with 0.

Default to 'fill_zero'.

Returns:
DataFrame

Prediction result, structured as follows:

  • 1st column : ID

  • 2nd column : SCORE, i.e. predicted class labels

  • 3rd column : CONFIDENCE, the confidence for assigning class labels.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the FFMClassifier class also inherits methods from PALBase class, please refer to PAL Base for more details.