ALS
- class hana_ml.algorithms.pal.recommender.ALS(random_state=None, max_iter=None, tol=None, exit_interval=None, implicit=None, linear_solver=None, cg_max_iter=None, thread_ratio=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, factor_num=None, lamb=None, alpha=None, reduction_rate=None, min_resource_rate=None, aggressive_elimination=None)
Alternating least squares (ALS) is a powerful matrix factorization algorithm for building both explicit and implicit feedback based recommender systems.
- Parameters:
- factor_numint, optional
Length of factor vectors in the model.
Default to 8.
- random_stateint, optional
Specifies the seed for random number generator.
0: Uses the current time as the seed.
Others: Uses the specified value as the seed.
Default to 0.
- lambfloat, optional
Specifies the L2 regularization of the factors.
Default to 1e-2
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- max_iterint, optional
Specifies the maximum number of iterations for the ALS algorithm.
Default to 20.
- tolfloat, optional
Specifies the tolerance for exiting the iterative algorithm.
The algorithm exits if the value of cost function is not decreased more than this value since the last check.
If
tol
is set to 0, there is no check, and the algorithm only exits on reaching the maximum number of iterations.Note that evaluations of cost function require additional calculations, and you can set this parameter to 0 to avoid it.
Default to 0.
- exit_intervalint, optional
Specifies the number of iterations between consecutive convergence checkings.
Basically, the algorithm calculates cost function and checks every
exit_interval
iterations to see if the tolerance has been reached.Note that evaluations of cost function require additional calculations.
Only valid when
tol
is not 0.Default to 5.
- implicitbool, optional
Specifies implicit/explicit ALS.
Default to False.
- linear_solver{'cholesky', 'cg'}, optional
Specifies the linear system solver.
Default to 'cholesky'.
- cg_max_iterint, optional
Specifies maximum number of iteration of cg solver.
Only valid when
linear_solver
is specified.Default to 3.
- alphafloat, optional
Used when computing the confidence level in implicit ALS.
Only valid when
implicit
is set to True.Default to 1.0.
- resampling_methodstr, optional
Specifies the resampling method for model evaluation or parameter selection.
Valid resampling methods include: 'cv', 'bootstrap', 'cv_sha', 'bootstrap_sha', 'cv_hyperband', 'bootstrap_hyperband'. It should be emphasized that the later four methods are designed for parameter selection only, not for model evaluation.
If not specified, neither model evaluation nor parameters selection is activated.
No default value.
- evaluation_metric{'rmse'}, optional
Specifies the evaluation metric for model evaluation or parameter selection.
If not specified, neither model evaluation nor parameter selection is activated.
No default value.
- fold_numint, optional
Specifies the fold number for the cross validation method.
Mandatory and valid only when
resampling_method
is set as 'cv'.Default to 1.
- repeat_timesint, optional
Specifies the number of repeat times for resampling.
Default to 1.
- search_strategy{'grid', 'random'}, optional
Specifies the method to activate parameter selection.
Mandatory when
resampling_method
is set as 'cv_sha' or 'bootstrap_sha'.Defaults to 'random' and cannot be changed if
resampling_method
is set as 'cv_hyperband' or 'bootstrap_hyperband', otherwise no default value.- random_search_timesint, optional
Specifies the number of times to randomly select candidate parameters for selection.
Mandatory and valid when
search_strategy
is set as 'random'.No default value.
- timeoutint, optional
Specifies maximum running time for model evaluation or parameter selection, in seconds.
No timeout when 0 is specified.
Default to 0.
- progress_indicator_idstr, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
No default value.
- param_valuesdict or ListOfTuples, optional
Specifies values of parameters to be selected.
Input should be a dict or list of size-two tuples, with key/1st element of each tuple being the target parameter name, while value/2nd element being the a list of valued for selection.
Valid only when
resampling_method
andsearch_strategy
are both specified.Valid parameter names include :
alpha
,factor_num
,lamb
.No default value.
- param_rangedict or ListOfTuples, optional
Specifies ranges of parameters to be selected.
Input should be a dict or list of size-two tuples, with key/1st element of each tuple being the name of the target parameter, and value/2nd element being a list that specifies the range of parameters with the following format:
[start, step, end] or [start, end].
Valid only Only when resampling_method and search_strategy are both specified.
Valid parameter names include :
alpha
,factor_num
,lamb
.No default value.
- reduction_ratefloat, optional
Specifies reduction rate in SHA or Hyperband method.
For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0
Valid only when
resampling_method
takes one of the following values: 'cv_sha', 'bootstrap_sha', 'cv_hyperband', 'bootstrap_hyperband'.Defaults to 3.0.
- min_resource_ratefloat, optional
Specifies the minimum resource rate that should be used in SHA or Hyperband iteration.
Valid only when
resampling_method
takes one of the following values: 'cv_sha', 'cv_hyperband', 'bootstrap_sha', 'bootstrap_hyperband'.Defaults to 0.0.
- aggressive_eliminationbool, optional
Specifies whether to apply aggressive elimination while using SHA method.
Aggressive elimination happens when the data size and parameters size to be searched does not match and there are still bunch of parameters to be searched while data size reaches its upper limits. If aggressive elimination is applied, lower bound of limit of data size will be used multiple times first to reduce number of parameters.
Valid only when
resampling_method
is 'cv_sha' or 'bootstrap_sha'.Defaults to False.
Examples
Input DataFrame df_train:
>>> df_train.collect() USER MOVIE FEEDBACK 0 A Movie1 4.8 1 A Movie2 4.0 ... 35 E Movie7 3.5 36 E Movie8 3.5
Create an ALS instance:
>>> als = ALS(factor_num=2, lamb=1e-2, max_iter=20, tol=1e-6, exit_interval=5, linear_solver='cholesky', thread_ratio=0, random_state=1)
Perform fit():
>>> als.fit(data=df_train)
>>> als.factors_.collect().head(10) FACTOR_ID FACTOR 0 0 1.108775 1 1 -0.582392 ... 8 8 1.151257 9 9 0.315342
Perform predict():
>>> res = als.predict(data=df_predict, thread_ratio=1, key='ID')
Output:
>>> res.collect() ID USER MOVIE PREDICTION 0 1 A Movie3 3.868747 1 2 A Movie7 2.870243 ... 6 7 D Movie5 4.325851 7 8 E Bad_Movie 2.545807
- Attributes:
- metadata_DataFrame
Model metadata content.
- map_DataFrame
Map info.
- factors_DataFrame
Decomposed factors.
- optim_param_DataFrame
Optimal parameters selected.
- stats_DataFrame
Statistics.
- iter_info_DataFrame
Cost function value and RMSE of corresponding iterations.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, usr, item, feedback])Fit the model to the training dataset.
predict
(data[, key, usr, item, thread_ratio])Prediction for the input data with the trained ALS model.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, usr=None, item=None, feedback=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
Data to be fitted for ALS model.
It provides the observed feedback of users for different items, thus should contain at least the following three columns:
the column for user names/IDs
the column for item names/IDs
the column for users' feedback values w.r.t. items
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- usrstr, optional
Name of the user column.
Defaults to the first non-key column of the input data.
- itemstr, optional
Name of the item column.
Defaults to the first non-key and non-usr column of the input data.
- feedbackstr, optional
Name of the feedback column, where each value reflects the feedback(scoring) value of a user w.r.t. an item.
Defaults to the last column of the input data.
- Returns:
- A fitted object of class "ALS".
- predict(data, key=None, usr=None, item=None, thread_ratio=None)
Prediction for the input data with the trained ALS model.
- Parameters:
- dataDataFrame
Data to be predicted, structured similarly as the input data for fit but only without the feedback column.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- usrstr, optional
Name of the user column.
Defaults to the first non-key column of the input data.
- itemstr, optional
Name of the item column.
Defaults to the first non-key and non-usr column of the input data.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to 0.
- Returns:
- DataFrame
Prediction result of the missing values(e.g. user feedback) in the input data, structured as follows:
1st column : Data ID
2nd column : User name/ID
3rd column : Item name/ID
4th column : Predicted feedback values
- create_model_state(model=None, function=None, pal_funcname='PAL_ALS', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for ALS.
- pal_funcnameint or str, optional
PAL function name.
Defaults to 'PAL_ALS'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
Inherited Methods from PALBase
Besides those methods mentioned above, the ALS class also inherits methods from PALBase class, please refer to PAL Base for more details.