FRM

class hana_ml.algorithms.pal.recommender.FRM(solver=None, factor_num=None, init=None, random_state=None, learning_rate=None, linear_lamb=None, lamb=None, max_iter=None, sgd_tol=None, sgd_exit_interval=None, thread_ratio=None, momentum=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, reduction_rate=None, min_resource_rate=None, aggressive_elimination=None)

Factorized Polynomial Regression Models or Factorization Machines approach.

Parameters:

solver{'sgd', 'momentum', 'nag', 'adagrad'}, optional

Specifies the method for solving the objective minimization problem.

Default to 'sgd'.

factor_numint, optional

Length of factor vectors in the model.

Default to 8.

initfloat, optional

Variance of the normal distribution used to initialize the model parameters.

Default to 1e-2.

random_stateint, optional

Specifies the seed for random number generator.

0: Uses the current time as the seed.

Others: Uses the specified value as the seed.

Note that due to the inherently randomicity of parallel sgc, models of different trainings might be different even with the same seed of random number generator.

Default to 0.

lambfloat, optional

L2 regularization of the factors.

Default to 1e-8.

linear_lambfloat, optional

L2 regularization of the factors.

Default to 1e-10.

thread_ratiofloat, optional

Controls the proportion of available threads that can be used.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

max_iterint, optional

Specifies the maximum number of iterations for the ALS algorithm.

Default value is 50.

sgd_tolfloat, optional

Exit threshold.

The algorithm exits when the cost function has not decreased more than this threshold in sgd_exit_interval steps.

Default to 1e-5

sgd_exit_intervalint, optional

The algorithm exits when the cost function has not decreased more than sgd_tol in sgd_exit_interval steps.

Default to 5.

momentumfloat, optional

The momentum factor in method 'momentum' or 'nag'.

Valid only when method is 'momentum' or 'nag'.

Default to 0.9.

resampling_method{'cv', 'bootstrap'}, optional

Specifies the resampling method for model evaluation or parameter selection.

If not specified, neither model evaluation nor parameter selection is activated.

No default value.

evaluation_metric{'rmse'}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

If not specified, neither model evaluation nor parameter selection is activated.

No default value.

fold_numint, optional

Specifies the fold number for the cross validation method.

Mandatory and valid only when resampling_method is set to 'cv'.

Default to 1.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Default to 1.

search_strategy{'grid', 'random'}, optional

Specifies the method to activate parameter selection.

No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection.

Mandatory and valid when PARAM_SEARCH_STRATEGY is set to random.

No default value.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds.

No timeout when 0 is specified.

Default to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection.

No progress indicator is active if no value is provided.

No default value.

param_valuesdict or ListOfTuples, optional

Specifies values of parameters to be selected.

Input should be a dict or list of tuple of two elements, with the key/1st element being the parameter name, and value/2nd element being a list of values for selection.

Valid only when resampling_method and search_strategy are both specified.

Valid parameter names include : 'factor_num', 'lamb', 'linear_lamb', 'momentum'.

No default value.

param_rangedict or ListOfTuples, optional

Specifies ranges of param to be selected.

Input should be a dict or list of tuple of two elements , with key/1st element being the parameter name, and value/2nd element being a list of numerical values indicating the range for selection.

Valid only when resampling_method and search_strategy are both specified.

Valid parameter names include:'factor_num', 'lamb', 'linear_lamb', 'momentum'.

No default value.

reduction_ratefloat, optional

Specifies reduction rate in SHA or Hyperband method.

For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0

Valid only when resampling_method takes one of the following values: 'cv_sha', 'bootstrap_sha', 'cv_hyperband', 'bootstrap_hyperband'.

Defaults to 3.0.

min_resource_ratefloat, optional

Specifies the minimum resource rate that should be used in SHA or Hyperband iteration.

Valid only when resampling_method takes one of the following values: 'cv_sha', 'cv_hyperband', 'bootstrap_sha', 'bootstrap_hyperband'.

Defaults to 0.0.

aggressive_eliminationbool, optional

Specifies whether to apply aggressive elimination while using SHA method.

Aggressive elimination happens when the data size and parameters size to be searched does not match and there are still bunch of parameters to be searched while data size reaches its upper limits. If aggressive elimination is applied, lower bound of limit of data size will be used multiple times first to reduce number of parameters.

Valid only when resampling_method is 'cv_sha' or 'bootstrap_sha'.

Defaults to False.

Examples

Input dataframe for training:

>>> df_train.collect()
  USER       MOVIE  FEEDBACK
  A      Movie1       4.8
  A      Movie2       4.0
  A      Movie4       4.0
  A      Movie5       4.0
  A      Movie6       4.8
  A      Movie8       3.8
  A   Bad_Movie       2.5
  B      Movie2       4.8
  B      Movie3       4.8
  B      Movie4       5.0
 B      Movie5       5.0
 B      Movie7       3.5
 B      Movie8       4.8
 B   Bad_Movie       2.8
 C      Movie1       4.1
 C      Movie2       4.2
 C      Movie4       4.2
 C      Movie5       4.0
 C      Movie6       4.2
 C      Movie7       3.2
 C      Movie8       3.0
 C   Bad_Movie       2.5
 D      Movie1       4.5
 D      Movie3       3.5
 D      Movie4       4.5
 D      Movie6       3.9
 D      Movie7       3.5
 D      Movie8       3.5
 D   Bad_Movie       2.5
 E      Movie1       4.5
 E      Movie2       4.0
 E      Movie3       3.5
 E      Movie4       4.5
 E      Movie5       4.5
 E      Movie6       4.2
 E      Movie7       3.5
 E      Movie8       3.5

Input user dataframe for training:

>>> usr_info.collect()
    USER            USER_SIDE_FEATURE
    -- There is no side information for user provided. --

Input item dataframe for training:

>>> item_info.collect()
   MOVIE              GENRES
  Movie1              Sci-Fi
  Movie2       Drama,Romance
  Movie3        Drama,Sci-Fi
  Movie4         Crime,Drama
  Movie5         Crime,Drama
  Movie6              Sci-Fi
  Movie7         Crime,Drama
  Movie8     Sci-Fi,Thriller
Bad_Movie    Romance,Thriller

Creating FRM instance:

>>> frm = FRM(factor_num=2, solver='adagrad',
              learning_rate=0, max_iter=100,
              thread_ratio=0.5, random_state=1)

Performing fit() on given dataframe:

>>> frm.fit(df_train, usr_info, item_info, categorical_variable='TIMESTAMP')

>>> frm.factors_.collect().head(10)
   FACTOR_ID      FACTOR
        0   -0.083550
        1   -0.083654
        2    0.582244
        3   -0.102799
        4   -0.441795
        5   -0.013341
        6   -0.099548
        7    0.245046
        8   -0.056534
        9   -0.342042

Performing predict() on given predicting dataframe:

>>> res = frm.predict(df_predict, usr_info, item_info, thread_ratio=0.5, key='ID')
>>> res.collect()
   ID USER  ITEM  PREDICTION
 1    A  None    3.486804
 2    A     4    3.490246
 3    B     2    5.436991
 4    B     3    5.287031
 5    C     2    3.015121
 6    D     1    3.602543
 7    D     3    4.097683
 8    E     2    2.317224

Attributes:

metadata_DataFrame: Model metadata content.
model_DataFrame: Model (Map, Weight)
factors_DataFrame: Decomposed factors.
optim_param_DataFrame: Optimal parameters selected.
stats_DataFrame: Statistic values
iter_info_DataFrame: Cost function value and RMSE of corresponding iteration.

Methods

`create_model_state`([model, function, ...])	Create PAL model state.
`delete_model_state`([state])	Delete PAL model state.
`fit`(data, usr_info, item_info[, key, usr, ...])	Fit the FRM model with input training data.
`predict`(data, usr_info, item_info[, key, ...])	Prediction for the input data with the trained FRM model.
`set_model_state`(state)	Set the model state by state information.

fit(data, usr_info, item_info, key=None, usr=None, item=None, feedback=None, features=None, usr_features=None, item_features=None, usr_key=None, item_key=None, categorical_variable=None, usr_categorical_variable=None, item_categorical_variable=None)

Fit the FRM model with input training data. Model parameters should be given by initializing the model first.

Parameters:

dataDataFrame

Data to be fit.

usr_infoDataFrame

DataFrame containing user side features.

item_infoDataFrame

DataFrame containing item side features.

keystr, optional

Name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it is assumed that data contains no ID column.

usrstr, optional

Name of the user column.

Defaults to the first non-key column of data.

itemstr, optional

Name of the item column.

Defaults to the first non-key and non-usr column of the input data.

feedbackstr, optional

Name of the feedback column.

Defaults to the last column of the input data.

featuresstr or a list of str, optional

Global side features column name in the training dataframe.

Defaults to the rest of input data removing key, usr, item and feedback columns.

usr_featuresstr or a list of str, optional

User side features column name in the training dataframe.

Defaults to all columns in usr_info exclusive of the one specified by usr_key.

item_featuresstr or a list of str, optional

Item side features column name in the training dataframe.

Defaults to all columns in item_info exclusive of the one specified by item_key.

user_keystr, optional

Specifies the column in usr_info that contains user names or IDs.

Defaults to the 1st column of usr_info.

item_keystr, optional

Specifies the column in item_info that contains item names or IDs.

Defaults to the 1st column of item_info

categorical_variablestr or a list of str, optional

Specifies the INTEGER columns in data that should be treated as categorical.

By default, a column of type 'VARCHAR' or 'NVARCHAR' is categorical, and a column of type 'INTEGER' or 'DOUBLE' is continuous.

usr_categorical_variablestr or a list of str, optional

Name of user side feature columns of INTEGER type that should be treated as categorical.

item_categorical_variablestr or a list of str, optional

Name of item side feature columns of INTEGER type that should be treated as categorical.

Returns:

Fitted object.

predict(data, usr_info, item_info, key=None, usr=None, item=None, features=None, thread_ratio=None)

Prediction for the input data with the trained FRM model.

Parameters:

dataDataFrame

Data to be fit.

usr_infoDataFrame

User side features.

item_infoDataFrame

Item side features.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

usrlist of str, optional

Name of the column containing user name or user ID. If not provided, it defaults to 1st non-ID column of data.

itemstr, optional

Name of the column containing item name or item ID.

If not provided, it defaults to the 1st non-ID, non-usr column of data.

featuresstr or a list of str, optional

Global side features column name in the training dataframe.

Defaults to all non key, usr and item columns of data.

thread_ratiofloat, optional

Specifies the upper limit of thread usage in proportion of current available threads.

The valid range of the value is [0,1].

Default to 0.

Returns:

DataFrame

Prediction result of FRM algorithm, structured as follows:

1st column : Data ID

2nd column : User name/ID

3rd column : Item name/Id

4th column : Predicted rating

create_model_state(model=None, function=None, pal_funcname='PAL_FRM', state_description=None, force=False)

Create PAL model state.

Parameters:

modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for FRM.

pal_funcnameint or str, optional

PAL function name.

Defaults to 'PAL_FRM'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

set_model_state(state)

Set the model state by state information.

Parameters:

state: DataFrame or dict

If state is DataFrame, it has the following structure:

NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

delete_model_state(state=None)

Delete PAL model state.

Parameters:

stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the FRM class also inherits methods from PALBase class, please refer to PAL Base for more details.