FFMClassifier
- class hana_ml.algorithms.pal.recommender.FFMClassifier(ordering=None, normalise=None, include_linear=None, include_constant=None, early_stop=None, random_state=None, factor_num=None, max_iter=None, train_size=None, learning_rate=None, linear_lamb=None, poly2_lamb=None, tol=None, exit_interval=None, handle_missing=None)
Field-Aware Factorization Machine with the task of classification.
- Parameters:
- factor_numint, optional
The factorization dimensionality. Default to 4.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time as the seed.
Others: Uses the specified value as the seed.
Default to 0.
- train_sizefloat, optional
The proportion of dataset used for training, and the remaining data set for validation.
For example, 0.8 indicates that 80% for training, and the remaining 20% for validation.
Default to 0.8 if number of instances not less than 40, 1.0 otherwise.
- max_iterint, optional
Specifies the maximum number of iterations for the alternative least square algorithm.
Default to 20
- orderinga list of str, optional(deprecated)
Specifies the categories orders for ranking.
This parameter is meaningless for classification problems and will be removed in future release.
No default value.
- normalisebool, optional
Specifies whether to normalize each instance so that its L1 norm is 1.
Default to True.
- include_constantbool, optional
Specifies whether to include the w0 constant part.
Default to True.
- include_linearbool, optional
Specifies whether to include the linear part of regression model.
Default to True.
- early_stopbool, optional
Specifies whether to early stop the SGD optimization.
Valid only if the value of
thread_ratio
is less than 1.Default to True.
- learning_ratefloat, optional
The learning rate for SGD iteration.
Default to 0.2.
- linear_lambfloat, optional
The L2 regularization parameter for the linear coefficient vector.
Default to 1e-5.
- poly2_lambfloat, optional
The L2 regularization parameter for factorized coefficient matrix of the quadratic term.
Default to 1e-5.
- tolfloat, optional
The criterion to determine the convergence of SGD.
Default to 1e-5.
- exit_intervalint, optional
The interval of two iterations for comparison to determine the convergence.
Default to 5.
- handle_missingstr, optional
Specifies how to handle missing feature:
'skip': skip rows with missing values.
'fill_zero': replace missing values with 0.
Default to 'fill_zero'.
Examples
Input dataframe for classification training:
>>> df_train_classification.collect() USER MOVIE TIMESTAMP CTR 0 A Movie1 3.0 Click 1 A Movie2 3.0 Click 2 A Movie4 1.0 Not click 3 A Movie5 2.0 Click 4 A Movie6 3.0 Click 5 A Movie8 2.0 Not click 6 A Movie0, Movie3 1.0 Click 7 B Movie2 3.0 Click 8 B Movie3 2.0 Click 9 B Movie4 2.0 Not click 10 B None 4.0 Not click 11 B Movie7 1.0 Click 12 B Movie8 2.0 Not click 13 B Movie0 3.0 Not click 14 C Movie1 2.0 Click 15 C Movie2, Movie5, Movie7 4.0 Not click 16 C Movie4 3.0 Not click 17 C Movie5 1.0 Not click 18 C Movie6 NaN Click 19 C Movie7 3.0 Not click 20 C Movie8 1.0 Click 21 C Movie0 2.0 Click 22 D Movie1 3.0 Click 23 D Movie3 2.0 Click 24 D Movie4, Movie7 2.0 Click 25 D Movie6 2.0 Click 26 D Movie7 4.0 Not click 27 D Movie8 3.0 Not click 28 D Movie0 3.0 Not click 29 E Movie1 2.0 Not click 30 E Movie2 2.0 Click 31 E Movie3 2.0 Click 32 E Movie4 4.0 Click 33 E Movie5 3.0 Click 34 E Movie6 2.0 Not click 35 E Movie7 4.0 Not click 36 E Movie8 3.0 Not click
Creating FFMClassifier instance:
>>> ffm = FFMClassifier(linear_lamb=1e-5, poly2_lamb=1e-6, random_state=1, factor_num=4, early_stop=1, learning_rate=0.2, max_iter=20, train_size=0.8)
Performing fit() on given dataframe:
>>> ffm.fit(data=df_train_classification, categorical_variable='TIMESTAMP') >>> ffm.stats_.collect() STAT_NAME STAT_VALUE 0 task classification 1 feature_num 18 2 field_num 3 3 k_num 4 4 category Click, Not click 5 iter 3 6 tr-loss 0.6409316561278655 7 va-loss 0.7452354780967997
Performing predict() on given predicting dataframe:
>>> res = ffm.predict(data=df_predict, key='ID', thread_ratio=1) >>> res.collect() ID SCORE CONFIDENCE 0 1 Not click 0.543537 1 2 Not click 0.545470 2 3 Click 0.542737 3 4 Click 0.519458 4 5 Click 0.511001 5 6 Not click 0.534610 6 7 Click 0.537739 7 8 Not click 0.536781 8 9 Not click 0.635412
- Attributes:
- meta_DataFrame
Model metadata content.
- coef_DataFrame
- DataFrame that provides the following information:
Feature name,
Field name,
The factorization number,
The parameter value.
- stats_DataFrame
Statistic values.
- cross_valid_DataFrame
Cross validation content.
Methods
fit
(data[, key, features, label, ...])Fit the FFMClassifier model with the input training data.
predict
(data[, key, features, thread_ratio, ...])Prediction for the input data with the trained FFMClassifier model.
- fit(data, key=None, features=None, label=None, categorical_variable=None, delimiter=None)
Fit the FFMClassifier model with the input training data. Model parameters should be given by initializing the model first.
- Parameters:
- dataDataFrame
Data to be fit.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresstr or a list of str optional
Name of the feature columns.
- delimiterstr, optional
The delimiter to separate string features.
For example, "China, USA" indicates two feature values "China" and "USA".
Default to ','.
- labelstr, optional
Specifies the dependent variable.
For classification, the label column can be any kind of data type.
Default to last column name.
- categorical_variablestr or a list of str optional
Indicates whether or not a column data is actually corresponding to a category variable even the data type of this column is INTEGER.
By default, 'VARCHAR' or 'NVARCHAR' is category variable, and 'INTEGER' or 'DOUBLE' is continuous variable.
- Returns:
- Fitted object.
- predict(data, key=None, features=None, thread_ratio=None, handle_missing=None)
Prediction for the input data with the trained FFMClassifier model.
- Parameters:
- dataDataFrame
Data to be fit.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresstr or a list of str optional
Global side features column name in the training dataframe.
- thread_ratiofloat, optional
The ratio of available threads.
0: single thread
0~1: percentage
Others: heuristically determined
Default to -1.
- handle_missingstr, optional
Specifies how to handle missing feature:
'skip': skip rows with missing values.
'fill_zero': replace missing values with 0.
Default to 'fill_zero'.
- Returns:
- DataFrame
Prediction result, structured as follows:
1st column : ID
2nd column : SCORE, i.e. predicted class labels
3rd column : CONFIDENCE, the confidence for assigning class labels.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
Inherited Methods from PALBase
Besides those methods mentioned above, the FFMClassifier class also inherits methods from PALBase class, please refer to PAL Base for more details.