OnlineMultiLogisticRegression

class hana_ml.algorithms.pal.linear_model.OnlineMultiLogisticRegression(class_label, init_learning_rate=None, decay=None, drop_rate=None, step_boundaries=None, constant_values=None, enet_alpha=None, enet_lambda=None, shuffle=None, shuffle_seed=None, weight_avg=None, weight_avg_begin=None, learning_rate_type=None, general_learning_rate=None, stair_case=None, cycle=None, epsilon=None, window_size=None)

This algorithm is the online version of Multi-Class Logistic Regression, while the Multi-Class Logistic Regression is offline/batch version. The difference is that during training phase, for the offline/batch version algorithm it requires all training data to be fed into the algorithm in one batch, then it tries its best to output one model to best fit the training data. This infers that the computer must have enough memory to store all data, and can obtain all data in one batch. Online version algorithm applies in scenario that either or all these two assumptions are not right.

Parameters

class_labela list of str

Indicates the class label and should be at least two class labels.

init_learning_ratefloat

The initial learning rate for learning rate schedule. Value should be larger than 0.

Only valid when learning_rate_type is 'Inverse_time_decay', 'Exponential_decay', 'Polynomial_decay'.

decayfloat

Specify the learning rate decay speed for learning rate schedule. Larger value indicates faster decay. Value should be larger than 0. When learning_rate_type is 'exponential_decay', value should be larger than 1.

Only valid when learning_rate_type is 'Inverse_time_decay', 'Exponential_decay', 'Polynomial_decay'.

drop_rateint

Specify the decay frequency. There are apparent effect when stair_case is True. Value should be larger than 0.

Only valid when learning_rate_type is 'Inverse_time_decay', 'Exponential_decay', 'Polynomial_decay'.

step_boundarieslist of int, optional

Specify the step boundaries for regions where step size remains constant. The format of this parameter is a list of integers.

The step value start from 0(no need to be specified), and the values should be in ascending order(e. g. [5, 8, 15, 23]).

Empty value for this parameter is allowed.

Only valid when learning_rate_type is 'Piecewise_constant_decay'.

constant_valueslist of float, optional

Specifies the constant values for each region defined by step_boundaries. The format of this parameter is a list of float numbers.

There should always be one more value than step_boundaries since n boundary points should give out n+1 regions in total.

Only valid when learning_rate_type is 'Piecewise_constant_decay'.

enet_alphafloat, optional

Elastic-Net mixing parameter. The valid range is [0, 1]. When it is 0, this means Ridge penalty; When it is 1, it is Lasso penalty.

Only valid when enet_lambda is not 0.0.

Defaults to 1.0.

enet_lambdafloat, optional

Penalized constant. The value should be larger than or equal to 0.0. The higher the value, the stronger the regularization. When it equal to 0.0, there is no regularization.

Defaults to 0.0.

shufflebool, optonal

Boolean value indicating whether need to shuffle the row order of observation data. False means keeping original order; True means performing shuffle operation.

Defaults to False.

shuffle_seedint, optonal

The seed is used to initialize the random generator to perform shuffle operation. The value of this parameter should be larger than or equal to 0. If need to reproduce the result when performing shuffle operation, please set this value to non-zero. Only valid when shuffle is True.

Defaults to 0.

weight_avgbool, optonal

Boolean value indicating whether need to perform average operator over output model. False means directly output model; True means perform average operator over output model. Currently only support Polyak Ruppert Average.

Defaults to False.

weight_avg_beginint, optonal

Specify the beginning step counter to perform the average operator over model. The value should be larger than or equal to 0. When current step counter is less than this parameter, just directly output model.Only valid when weight_avg is True.

Defaults to 0.

learning_rate_typestr, optonal

Specify the learning rate type for SGD algorithm.

'Inverse_time_decay'

'Exponential_decay'

'Polynomial_decay'

'Piecewise_constant_decay'

'AdaGrad'

'AdaDelta'

'RMSProp'

Defaults to 'RMSProp'.

general_learning_ratefloat, optonal

Specify the general learning rate used in AdaGrad and RMSProp. The value should be larger than 0.

Only valid when learning_rate_type is 'AdaGrad', 'RMSProp'.

Defaults to 0.001.

stair_casebool, optonal

Boolean value indicate the drop way of step size. False means drop step size smoothly.

Only valid when learning_rate_type is 'Inverse_time_decay', 'Exponential_decay'.

Defaults to False.

cyclebool, optonal

indicate whether need to cycle from the start when reaching specified end learning rate. False means do not cycle from the start; True means cycle from the start.

Only valid when learning_rate_type is 'Polynomial_decay'.

Defaults to False.

epsilonfloat, optonal

This parameter has multiple purposes depending on the learn rate type. The value should be within (0, 1). When used in learn rate type 0 and 1, it represent the smallest allowable step size. When step size reach this value, it will no longer change. When used in learning_rate_type 'Polynomial_decay', it represent the end learn rate. When used in learning_rate_type 'AdaGrad', 'AdaDelta', 'RMSProp', it is used to avoid dividing by 0.

Only valid when learning_rate_type is not 'Piecewise_constant_decay'.

Defaults to 1E-8.

window_sizefloat, optonal

This parameter controls the moving window size of recent steps. The value should be in range (0, 1). Larger value means more steps are kept in track.

Only valid when learning_rate_type is 'AdaDelta', 'RMSProp'.

Defaults to 0.9.

Examples

First, initialize an online multi logistic regression instance:

>>> omlr = OnlineMultiLogisticRegression(class_label=['0','1','2'], enet_lambda=0.01,
                                         enet_alpha=0.2, weight_avg=True,
                                         weight_avg_begin=8, learning_rate_type = 'rmsprop',
                                         general_learning_rate=0.1,
                                         window_size=0.9, epsilon = 1e-6)

Four rounds of data:

>>> df_1.collect()
         X1        X2    Y
1.160456 -0.079584  0.0
1.216722 -1.315348  2.0
1.018474 -0.600647  1.0
0.884580  1.546115  1.0
2.432160  0.425895  1.0
1.573506 -0.019852  0.0
1.285611 -2.004879  1.0
0.478364 -1.791279  2.0

>>> df_2.collect()
         X1        X2    Y
-1.799803  1.225313  1.0
0.552956 -2.134007  2.0
0.750153 -1.332960  2.0
2.024223 -1.406925  2.0
1.204173 -1.395284  1.0
1.745183  0.647891  0.0
1.406053  0.180530  0.0
1.880983 -1.627834  2.0

>>> df_3.collect()
         X1        X2    Y
1.860634 -2.474313  2.0
0.710662 -3.317885  2.0
1.153588  0.539949  0.0
1.297490 -1.811933  2.0
2.071784  0.351789  0.0
1.552456  0.550787  0.0
1.202615 -1.256570  2.0
-2.348316  1.384935  1.0

>>> df_4.collect()
         X1        X2    Y
-2.132380  1.457749  1.0
0.549665  0.174078  1.0
1.422629  0.815358  0.0
1.318544  0.062472  0.0
0.501686 -1.286537  1.0
1.541711  0.737517  1.0
1.709486 -0.036971  0.0
1.708367  0.761572  0.0

Round 1, invoke partial_fit() for training the model with df_1:

>>> omlr.partial_fit(self.df_1, label='Y', features=['X1', 'X2'])

Output:

>>> omlr.coef_.collect()
       VARIABLE_NAME CLASSLABEL  COEFFICIENT
__PAL_INTERCEPT__          0    -0.245137
__PAL_INTERCEPT__          1     0.112396
__PAL_INTERCEPT__          2    -0.236284
               X1          0    -0.189930
               X1          1     0.218920
               X1          2    -0.372500
               X2          0     0.279547
               X2          1     0.458214
               X2          2    -0.185378

>>> omlr.online_result_.collect()
   SEQUENCE                          UPDATED_SERIALIZED_RESULT
0         0  {"SGD":{"data":{"avg_feature_coefficient":[0.0...

Round 2, invoke partial_fit() for training the model with df_2:

>>> omlr.partial_fit(self.df_2, label='Y', features=['X1', 'X2'])

Output:

>>> omlr.coef_.collect()
        VARIABLE_NAME CLASSLABEL  COEFFICIENT
__PAL_INTERCEPT__          0    -0.359296
__PAL_INTERCEPT__          1     0.163218
__PAL_INTERCEPT__          2    -0.182423
               X1          0    -0.045149
               X1          1    -0.046508
               X1          2    -0.122690
               X2          0     0.420425
               X2          1     0.594954
               X2          2    -0.451050

>>> omlr.online_result_.collect()
   SEQUENCE                          UPDATED_SERIALIZED_RESULT
0         0  {"SGD":{"data":{"avg_feature_coefficient":[-0....

Round 3, invoke partial_fit() for training the model with df_3:

>>> omlr.partial_fit(self.df_3, label='Y', features=['X1', 'X2'])

Output:

>>> omlr.coef_.collect()
       VARIABLE_NAME CLASSLABEL  COEFFICIENT
__PAL_INTERCEPT__          0    -0.225687
__PAL_INTERCEPT__          1     0.031453
__PAL_INTERCEPT__          2    -0.173944
               X1          0     0.100580
               X1          1    -0.208257
               X1          2    -0.097395
               X2          0     0.628975
               X2          1     0.576544
               X2          2    -0.582955

>>> omlr.online_result_.collect()
   SEQUENCE                          UPDATED_SERIALIZED_RESULT
0         0  {"SGD":{"data":{"avg_feature_coefficient":[0.1...

Round 4, invoke partial_fit() for training the model with df_4:

>>> omlr.partial_fit(self.df_4, label='Y', features=['X1', 'X2'])

Output:

>>> omlr.coef_.collect()
      VARIABLE_NAME CLASSLABEL  COEFFICIENT
__PAL_INTERCEPT__          0    -0.204118
__PAL_INTERCEPT__          1     0.071965
__PAL_INTERCEPT__          2    -0.263698
               X1          0     0.239740
               X1          1    -0.326290
               X1          2    -0.139859
               X2          0     0.696389
               X2          1     0.590014
               X2          2    -0.643752

>>> omlr.online_result_.collect()
   SEQUENCE                          UPDATED_SERIALIZED_RESULT
0         0  {"SGD":{"data":{"avg_feature_coefficient":[0.2...

Call predict() with df_predict:

>>> df_predict.collect()
   ID   X1   X2
0   1  1.2  0.7
1   2  1.0 -2.0

Invoke predict():

>>> fitted = onlinelr.predict(df_predict, key='ID', features=['X1', 'X2'])
>>> fitted.collect()
   ID CLASS  PROBABILITY
0   1     0     0.539350
1   2     2     0.830026

Attributes

coef_DataFrame: Values of the coefficients.
online_result_DataFrame: Online Model content.

Methods

`partial_fit`(data[, key, features, label, ...])	Online training based on each round of data.
`predict`(data[, key, features])	Predict dependent variable values based on a fitted model.
`score`(data[, key, features, label])	Returns the coefficient of determination R2 of the prediction.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

partial_fit(data, key=None, features=None, label=None, thread_ratio=None, progress_indicator_id=None)

Online training based on each round of data.

Parameters

dataDataFrame

Training data.

keystr, optional

Name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;
otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

If label is not provided, it defaults to the last column.

thread_ratiofloat, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

progress_indicator_idstr, optional

The ID of progress indicator for model evaluation/parameter selection.

Progress indicator deactivated if no value provided.

Returns

OnlineMultiLogisticRegression: A fitted object.

predict(data, key=None, features=None)

Predict dependent variable values based on a fitted model.

Parameters

dataDataFrame

Independent variable values to predict for.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

Returns

DataFrame

Predicted values, structured as follows:

ID column: with same name and type as data 's ID column.

VALUE: type DOUBLE, representing predicted values.

score(data, key=None, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters

dataDataFrame

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

If label is not provided, it defaults to the last column.

Returns

float: Returns the coefficient of determination R2 of the prediction.

Inherited Methods from PALBase

Besides those methods mentioned above, the OnlineMultiLogisticRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.