FFMRanker
- class hana_ml.algorithms.pal.recommender.FFMRanker(ordering=None, normalise=None, include_linear=None, early_stop=None, random_state=None, factor_num=None, max_iter=None, train_size=None, learning_rate=None, linear_lamb=None, poly2_lamb=None, tol=None, exit_interval=None, handle_missing=None)
Field-Aware Factorization Machine with the task of ranking.
- Parameters
- factor_numint, optional
The factorization dimensionality.
Default to 4.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time as the seed.
Others: Uses the specified value as the seed.
Default to 0.
- train_sizefloat, optional
The proportion of data used for training, and the remaining data set for validation.
For example, 0.8 indicates that 80% for training, and the remaining 20% for validation.
Default to 0.8 if number of instances not less than 40, 1.0 otherwise.
- max_iterint, optional
Specifies the maximum number of iterations for the ALS algorithm.
Default to 20.
- orderingListOfStrings, optional
Specifies the categories orders(in ascending) for ranking.
No default value.
- normalisebool, optional
Specifies whether to normalize each instance so that its L1 norm is 1.
Default to True.
- include_linearbool, optional
Specifies whether to include the the linear part of the model.
Default to True.
- early_stopbool, optional
Specifies whether to early stop the SGD optimization.
Valid only if the value of
train_size
is less than 1.Default to True.
- learning_ratefloat, optional
The learning rate for SGD iteration.
Default to 0.2.
- linear_lambfloat, optional
The L2 regularization parameter for the linear coefficient vector.
Default to 1e-5.
- poly2_lambfloat, optional
The L2 regularization parameter for factorized coefficient matrix of the quadratic term.
Default to 1e-5.
- tolfloat, optional
The criterion to determine the convergence of SGD.
Default to 1e-5.
- exit_intervalint, optional
The interval of two iterations for comparison to determine the convergence.
Default to 5.
- handle_missing{'skip', 'fill_zero'}, optional
Specifies how to handle missing feature:
'skip': remove rows with missing values.
'fill_zero': replace missing values with 0.
Default to 'fill_zero'.
Examples
Input dataframe for regression training:
>>> df_train_ranker.collect() USER MOVIE TIMESTAMP CTR 0 A Movie1 3.0 medium 1 A Movie2 3.0 too high 2 A Movie4 1.0 medium 3 A Movie5 2.0 too low 4 A Movie6 3.0 low 5 A Movie8 2.0 low 6 A Movie0, Movie3 1.0 too high 7 B Movie2 3.0 high 8 B Movie3 2.0 high 9 B Movie4 2.0 medium 10 B None 4.0 medium 11 B Movie7 1.0 high 12 B Movie8 2.0 high 13 B Movie0 3.0 high 14 C Movie1 2.0 medium 15 C Movie2, Movie5, Movie7 4.0 low 16 C Movie4 3.0 too low 17 C Movie5 1.0 high 18 C Movie6 NaN too high 19 C Movie7 3.0 high 20 C Movie8 1.0 too high 21 C Movie0 2.0 medium 22 D Movie1 3.0 too high 23 D Movie3 2.0 too high 24 D Movie4, Movie7 2.0 too high 25 D Movie6 2.0 too high 26 D Movie7 4.0 too high 27 D Movie8 3.0 too low 28 D Movie0 3.0 too low 29 E Movie1 2.0 too low 30 E Movie2 2.0 too high 31 E Movie3 2.0 medium 32 E Movie4 4.0 low 33 E Movie5 3.0 too high 34 E Movie6 2.0 low 35 E Movie7 4.0 low 36 E Movie8 3.0 too low
Creating FFMRanker instance:
>>> ffm = FFMRanker(ordering=['too low', 'low', 'medium', 'high', 'too high'], factor_num=4, early_stop=True, learning_rate=0.2, max_iter=20, train_size=0.8, linear_lamb=1e-5, poly2_lamb=1e-6, random_state=1)
Performing fit() on given dataframe:
>>> ffm.fit(data=self.df_train_rank, categorical_variable='TIMESTAMP') >>> ffm.stats_.collect() STAT_NAME STAT_VALUE 0 task ranking 1 feature_num 18 2 field_num 3 3 k_num 4 4 category too low, low, medium, high, too high 5 iter 14 6 tr-loss 1.3432013591533276 7 va-loss 1.5509792122994928
Performing predict() on given predicting dataframe:
>>> res = ffm.predict(data=self.df_predict, key='ID', thread_ratio=1)
>>> res.collect() ID SCORE CONFIDENCE 0 1 high 0.294206 1 2 medium 0.209893 2 3 too low 0.316609 3 4 high 0.219671 4 5 too high 0.222545 5 6 high 0.385621 6 7 too low 0.407695 7 8 too low 0.295200 8 9 high 0.282633
- Attributes
- metadata_DataFrame
Model metadata content.
- coef_DataFrame
- The DataFrame inclusive of the following information:
Feature name,
Field name,
The factorization number,
The parameter value.
- stats_DataFrame
Statistic values.
- cross_valid_DataFrame
Cross validation content.
Methods
fit
(data[, key, features, label, ...])Fit the FFMRanker model with the input training data.
predict
(data[, key, features, thread_ratio, ...])Prediction for the input data with the trained FFMRanker model.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- fit(data, key=None, features=None, label=None, categorical_variable=None, delimiter=None)
Fit the FFMRanker model with the input training data. Model parameters should be given by initializing the model first.
- Parameters
- dataDataFrame
Data to be fit.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresstr/ListOfStrings, optional
Name of the feature columns.
- delimiterstr, optional
The delimiter to separate string features.
For example, "China, USA" indicates two feature values "China" and "USA".
Default to ','.
- labelstr, optional
Specifies the dependent variable.
For ranking, the label column must have categorical data type.
Default to last column name.
- categorical_variablestr/ListofStrings, optional
Indicates whether or not a column data is actually corresponding to a category variable even the data type of this column is INTEGER.
By default, 'VARCHAR' or 'NVARCHAR' is category variable, and 'INTEGER' or 'DOUBLE' is continuous variable.
- Returns
- Fitted object.
- predict(data, key=None, features=None, thread_ratio=None, handle_missing=None)
Prediction for the input data with the trained FFMRanker model.
- Parameters
- dataDataFrame
Data to be fit.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresstr/ListOfStrings, optional
Global side features column name in the training dataframe.
- thread_ratiofloat, optional
The ratio of available threads.
0: single thread
0~1: percentage
Others: heuristically determined
Default to -1.
- handle_missingstr, optional
Specifies how to handle missing feature:
'skip': remove rows with missing values.
'fill_zero': replace missing values with 0.
Default to 'fill_zero'.
- Returns
- DataFrame
Prediction result, structured as follows:
1st column : ID
2nd column : SCORE, i.e. predicted ranking
3rd column : CONFIDENCE, the confidence for ranking.