OnlineMultiLogisticRegression
- class hana_ml.algorithms.pal.linear_model.OnlineMultiLogisticRegression(class_label, init_learning_rate=None, decay=None, drop_rate=None, step_boundaries=None, constant_values=None, enet_alpha=None, enet_lambda=None, shuffle=None, shuffle_seed=None, weight_avg=None, weight_avg_begin=None, learning_rate_type=None, general_learning_rate=None, stair_case=None, cycle=None, epsilon=None, window_size=None)
This algorithm is the online version of Multi-Class Logistic Regression, while the Multi-Class Logistic Regression is offline/batch version. The difference is that during training phase, for the offline/batch version algorithm it requires all training data to be fed into the algorithm in one batch, then it tries its best to output one model to best fit the training data. This infers that the computer must have enough memory to store all data, and can obtain all data in one batch. Online version algorithm applies in scenario that either or all these two assumptions are not right.
- Parameters:
- class_labela list of str
Indicates the class label and should be at least two class labels.
- init_learning_ratefloat
The initial learning rate for learning rate schedule. Value should be larger than 0. Only valid when
learning_rate_type
is 'Inverse_time_decay', 'Exponential_decay', 'Polynomial_decay'.- decayfloat
Specify the learning rate decay speed for learning rate schedule. Larger value indicates faster decay. Value should be larger than 0. When
learning_rate_type
is 'exponential_decay', value should be larger than 1.Only valid when
learning_rate_type
is 'Inverse_time_decay', 'Exponential_decay', 'Polynomial_decay'.- drop_rateint
Specify the decay frequency. There are apparent effect when
stair_case
is True. Value should be larger than 0.Only valid when
learning_rate_type
is 'Inverse_time_decay', 'Exponential_decay', 'Polynomial_decay'.- step_boundarieslist of int, optional
Specify the step boundaries for regions where step size remains constant. The format of this parameter is a list of integers.
The step value start from 0(no need to be specified), and the values should be in ascending order(e. g. [5, 8, 15, 23]).
Empty value for this parameter is allowed.
Only valid when
learning_rate_type
is 'Piecewise_constant_decay'.- constant_valueslist of float, optional
Specifies the constant values for each region defined by
step_boundaries
. The format of this parameter is a list of float numbers.There should always be one more value than
step_boundaries
since n boundary points should give out n+1 regions in total.Only valid when
learning_rate_type
is 'Piecewise_constant_decay'.- enet_alphafloat, optional
Elastic-Net mixing parameter. The valid range is [0, 1]. When it is 0, this means Ridge penalty; When it is 1, it is Lasso penalty.
Only valid when
enet_lambda
is not 0.0.Defaults to 1.0.
- enet_lambdafloat, optional
Penalized constant. The value should be larger than or equal to 0.0. The higher the value, the stronger the regularization. When it equal to 0.0, there is no regularization.
Defaults to 0.0.
- shufflebool, optonal
Boolean value indicating whether need to shuffle the row order of observation data. False means keeping original order; True means performing shuffle operation.
Defaults to False.
- shuffle_seedint, optonal
The seed is used to initialize the random generator to perform shuffle operation. The value of this parameter should be larger than or equal to 0. If need to reproduce the result when performing shuffle operation, please set this value to non-zero. Only valid when
shuffle
is True.Defaults to 0.
- weight_avgbool, optonal
Boolean value indicating whether need to perform average operator over output model. False means directly output model; True means perform average operator over output model. Currently only support Polyak Ruppert Average.
Defaults to False.
- weight_avg_beginint, optonal
Specify the beginning step counter to perform the average operator over model. The value should be larger than or equal to 0. When current step counter is less than this parameter, just directly output model.Only valid when
weight_avg
is True.Defaults to 0.
- learning_rate_typestr, optonal
Specify the learning rate type for SGD algorithm.
'Inverse_time_decay'
'Exponential_decay'
'Polynomial_decay'
'Piecewise_constant_decay'
'AdaGrad'
'AdaDelta'
'RMSProp'
Defaults to 'RMSProp'.
- general_learning_ratefloat, optonal
Specify the general learning rate used in AdaGrad and RMSProp. The value should be larger than 0. Only valid when
learning_rate_type
is 'AdaGrad', 'RMSProp'.Defaults to 0.001.
- stair_casebool, optonal
Boolean value indicate the drop way of step size. False means drop step size smoothly. Only valid when
learning_rate_type
is 'Inverse_time_decay', 'Exponential_decay'.Defaults to False.
- cyclebool, optonal
indicate whether need to cycle from the start when reaching specified end learning rate. False means do not cycle from the start; True means cycle from the start.
Only valid when
learning_rate_type
is 'Polynomial_decay'.Defaults to False.
- epsilonfloat, optonal
This parameter has multiple purposes depending on the learn rate type. The value should be within (0, 1). When used in learn rate type 0 and 1, it represent the smallest allowable step size. When step size reach this value, it will no longer change. When used in
learning_rate_type
'Polynomial_decay', it represent the end learn rate. When used inlearning_rate_type
'AdaGrad', 'AdaDelta', 'RMSProp', it is used to avoid dividing by 0.Only valid when
learning_rate_type
is not 'Piecewise_constant_decay'.Defaults to 1E-8.
- window_sizefloat, optonal
This parameter controls the moving window size of recent steps. The value should be in range (0, 1). Larger value means more steps are kept in track. Only valid when
learning_rate_type
is 'AdaDelta', 'RMSProp'.Defaults to 0.9.
Examples
>>> omlr = OnlineMultiLogisticRegression(class_label=['0','1','2'], enet_lambda=0.01, enet_alpha=0.2, weight_avg=True, weight_avg_begin=8, learning_rate_type='rmsprop', general_learning_rate=0.1, window_size=0.9, epsilon=1e-6)
In each run, you could invoke partial_fit() to train the model with a new DataFrame. The use of df_1 as an example, is shown below.
>>> omlr.partial_fit(data=df_1, label='Y', features=['X1', 'X2'])
Output:
>>> omlr.coef_.collect() >>> omlr.online_result_.collect()
Perform predict():
>>> onlinelr.predict(data=df_predict).collect()
Perform score():
>>> onlinelr.predict(data=df_score).collect()
- Attributes:
- coef_DataFrame
Values of the coefficients.
- online_result_DataFrame
Online Model content.
Methods
partial_fit
(data[, key, features, label, ...])Online training based on each round of data.
predict
(data[, key, features])Predict dependent variable values based on a fitted model.
score
(data[, key, features, label])Returns the coefficient of determination R2 of the prediction.
- partial_fit(data, key=None, features=None, label=None, thread_ratio=None, progress_indicator_id=None)
Online training based on each round of data.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column. If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresa list of str, optional
Names of the feature columns. If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
If
label
is not provided, it defaults to the last column.- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.0.
- progress_indicator_idstr, optional
The ID of progress indicator for model evaluation/parameter selection.
Progress indicator deactivated if no value provided.
- Returns:
- A fitted object of class "OnlineMultiLogisticRegression".
- predict(data, key=None, features=None)
Predict dependent variable values based on a fitted model.
- Parameters:
- dataDataFrame
Independent variable values to predict for.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.
- Returns:
- DataFrame
Predicted values, structured as follows:
ID column: with same name and type as
data
's ID column.VALUE: type DOUBLE, representing predicted values.
- score(data, key=None, features=None, label=None)
Returns the coefficient of determination R2 of the prediction.
- Parameters:
- dataDataFrame
Data on which to assess model performance.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
If
label
is not provided, it defaults to the last column.
- Returns:
- float
Returns the coefficient of determination R2 of the prediction.
Inherited Methods from PALBase
Besides those methods mentioned above, the OnlineMultiLogisticRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.