FFMClassifier
- class hana_ml.algorithms.pal.recommender.FFMClassifier(ordering=None, normalise=None, include_linear=None, include_constant=None, early_stop=None, random_state=None, factor_num=None, max_iter=None, train_size=None, learning_rate=None, linear_lamb=None, poly2_lamb=None, tol=None, exit_interval=None, handle_missing=None)
Field-Aware Factorization Machine with the task of classification.
- Parameters
- factor_numint, optional
The factorization dimensionality. Default to 4.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time as the seed.
Others: Uses the specified value as the seed.
Default to 0.
- train_sizefloat, optional
The proportion of dataset used for training, and the remaining data set for validation.
For example, 0.8 indicates that 80% for training, and the remaining 20% for validation.
Default to 0.8 if number of instances not less than 40, 1.0 otherwise.
- max_iterint, optional
Specifies the maximum number of iterations for the alternative least square algorithm.
Default to 20
- orderingListOfStrings, optional(deprecated)
Specifies the categories orders for ranking.
This parameter is meaningless for classification problems and will be removed in future release.
No default value.
- normalisebool, optional
Specifies whether to normalize each instance so that its L1 norm is 1.
Default to True.
- include_constantbool, optional
Specifies whether to include the w0 constant part.
Default to True.
- include_linearbool, optional
Specifies whether to include the linear part of regression model.
Default to True.
- early_stopbool, optional
Specifies whether to early stop the SGD optimization.
Valid only if the value of
thread_ratio
is less than 1.Default to True.
- learning_ratefloat, optional
The learning rate for SGD iteration.
Default to 0.2.
- linear_lambfloat, optional
The L2 regularization parameter for the linear coefficient vector.
Default to 1e-5.
- poly2_lambfloat, optional
The L2 regularization parameter for factorized coefficient matrix of the quadratic term.
Default to 1e-5.
- tolfloat, optional
The criterion to determine the convergence of SGD.
Default to 1e-5.
- exit_intervalint, optional
The interval of two iterations for comparison to determine the convergence.
Default to 5.
- handle_missingstr, optional
Specifies how to handle missing feature:
'skip': skip rows with missing values.
'fill_zero': replace missing values with 0.
Default to 'fill_zero'.
Examples
Input dataframe for classification training:
>>> df_train_classification.collect() USER MOVIE TIMESTAMP CTR 0 A Movie1 3.0 Click 1 A Movie2 3.0 Click 2 A Movie4 1.0 Not click 3 A Movie5 2.0 Click 4 A Movie6 3.0 Click 5 A Movie8 2.0 Not click 6 A Movie0, Movie3 1.0 Click 7 B Movie2 3.0 Click 8 B Movie3 2.0 Click 9 B Movie4 2.0 Not click 10 B None 4.0 Not click 11 B Movie7 1.0 Click 12 B Movie8 2.0 Not click 13 B Movie0 3.0 Not click 14 C Movie1 2.0 Click 15 C Movie2, Movie5, Movie7 4.0 Not click 16 C Movie4 3.0 Not click 17 C Movie5 1.0 Not click 18 C Movie6 NaN Click 19 C Movie7 3.0 Not click 20 C Movie8 1.0 Click 21 C Movie0 2.0 Click 22 D Movie1 3.0 Click 23 D Movie3 2.0 Click 24 D Movie4, Movie7 2.0 Click 25 D Movie6 2.0 Click 26 D Movie7 4.0 Not click 27 D Movie8 3.0 Not click 28 D Movie0 3.0 Not click 29 E Movie1 2.0 Not click 30 E Movie2 2.0 Click 31 E Movie3 2.0 Click 32 E Movie4 4.0 Click 33 E Movie5 3.0 Click 34 E Movie6 2.0 Not click 35 E Movie7 4.0 Not click 36 E Movie8 3.0 Not click
Creating FFMClassifier instance:
>>> ffm = FFMClassifier(linear_lamb=1e-5, poly2_lamb=1e-6, random_state=1, factor_num=4, early_stop=1, learning_rate=0.2, max_iter=20, train_size=0.8)
Performing fit() on given dataframe:
>>> ffm.fit(data=self.df_train_classification, categorical_variable='TIMESTAMP') >>> ffm.stats_.collect() STAT_NAME STAT_VALUE 0 task classification 1 feature_num 18 2 field_num 3 3 k_num 4 4 category Click, Not click 5 iter 3 6 tr-loss 0.6409316561278655 7 va-loss 0.7452354780967997
Performing predict() on given predicting dataframe:
>>> res = ffm.predict(data=self.df_predict, key='ID', thread_ratio=1) >>> res.collect() ID SCORE CONFIDENCE 0 1 Not click 0.543537 1 2 Not click 0.545470 2 3 Click 0.542737 3 4 Click 0.519458 4 5 Click 0.511001 5 6 Not click 0.534610 6 7 Click 0.537739 7 8 Not click 0.536781 8 9 Not click 0.635412
- Attributes
- metadata_DataFrame
Model metadata content.
- coef_DataFrame
- DataFrame that provides the following information:
Feature name,
Field name,
The factorization number,
The parameter value.
- stats_DataFrame
Statistic values.
- cross_valid_DataFrame
Cross validation content.
Methods
fit
(data[, key, features, label, ...])Fit the FFMClassifier model with the input training data.
predict
(data[, key, features, thread_ratio, ...])Prediction for the input data with the trained FFMClassifier model.
- fit(data, key=None, features=None, label=None, categorical_variable=None, delimiter=None)
Fit the FFMClassifier model with the input training data. Model parameters should be given by initializing the model first.
- Parameters
- dataDataFrame
Data to be fit.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresstr/ListOfStrings, optional
Name of the feature columns.
- delimiterstr, optional
The delimiter to separate string features.
For example, "China, USA" indicates two feature values "China" and "USA".
Default to ','.
- labelstr, optional
Specifies the dependent variable.
For classification, the label column can be any kind of data type.
Default to last column name.
- categorical_variablestr/ListofStrings, optional
Indicates whether or not a column data is actually corresponding to a category variable even the data type of this column is INTEGER.
By default, 'VARCHAR' or 'NVARCHAR' is category variable, and 'INTEGER' or 'DOUBLE' is continuous variable.
- Returns
- Fitted object.
- predict(data, key=None, features=None, thread_ratio=None, handle_missing=None)
Prediction for the input data with the trained FFMClassifier model.
- Parameters
- dataDataFrame
Data to be fit.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresstr/ListOfStrings, optional
Global side features column name in the training dataframe.
- thread_ratiofloat, optional
The ratio of available threads.
0: single thread
0~1: percentage
Others: heuristically determined
Default to -1.
- handle_missingstr, optional
Specifies how to handle missing feature:
'skip': skip rows with missing values.
'fill_zero': replace missing values with 0.
Default to 'fill_zero'.
- Returns
- DataFrame
Prediction result, structured as follows:
1st column : ID
2nd column : SCORE, i.e. predicted class labels
3rd column : CONFIDENCE, the confidence for assigning class labels.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.