OnlineLinearRegression

class hana_ml.algorithms.pal.linear_model.OnlineLinearRegression(enet_lambda=None, enet_alpha=None, max_iter=None, tol=None)

Online linear regression (Stateless) is an online version of the linear regression and is used when the training data are obtained multiple rounds. Additional data are obtained in each round of training. By making use of the current computed linear model and combining with the obtained data in each round, online linear regression adapts the linear model to make the prediction as precise as possible.

Note

We currently support Online Linear Regression(stateless) in SAP HANA Cloud. Online Linear Regression(stateful) version available in SAP HANA SPS05/06 has not been supported in hana-ml yet.

Parameters
enet_lambdafloat, optional

Penalized weight. Value should be greater than or equal to 0.

Defaults to 0.

enet_alphafloat, optional

Elastic net mixing parameter. Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively.

Defaults to 0.

max_iterint, optional

Maximum iterative cycle. Defaults to 1000.

tolfloat, optional

Convergence threshold. Defaults to 1.0e-5.

Examples

First, initialize an online linear regression instance:

>>> onlinelr = OnlineLinearRegression(enet_lambda=0.1,
                                      enet_alpha=0.5,
                                      max_iter=1200,
                                      tol=1E-6)

Three rounds of data:

>>> df_1.collect()
  ID      Y    X1    X2
0  1  130.0   7.0  26.0
1  2  124.0   1.0  29.0
2  3  262.0  11.0  56.0
3  4  162.0  11.0  31.0
>>> df_2.collect()
   ID      Y    X1    X2
0   5  234.0   7.0  52.0
1   6  258.0  11.0  55.0
2   7  298.0   3.0  71.0
3   8  132.0   1.0  31.0
>>> df_3.collect()
   ID      Y    X1    X2
0   9  227.0   2.0  54.0
1  10  256.0  21.0  47.0
2  11  168.0   1.0  40.0
3  12  302.0  11.0  66.0
4  13  307.0  10.0  68.0

Round 1, invoke partial_fit() for training the model with df_1:

>>> onlinelr.partial_fit(df_1, key='ID', label='Y', features=['X1', 'X2'])

Output:

>>> onlinelr.coefficients_.collect()
      VARIABLE_NAME   COEFFICIENT_VALUE
0  __PAL_INTERCEPT__           5.076245
1                 X1           2.987277
2                 X2           4.000540
>>> onlinelr.intermediate_result_.collect()
  SEQUENCE                                 INTERMEDIATE_MODEL
0        0  {"algorithm":"batch_algorithm","batch_algorith...

Round 2, invoke partial_fit() for training the model with df_2:

>>> onlinelr.partial_fit(df_2, key='ID', label='Y', features=['X1', 'X2'])

Output:

>>> onlinelr.coefficients_.collect()
       VARIABLE_NAME  COEFFICIENT_VALUE
0  __PAL_INTERCEPT__           5.094444
1                 X1           2.988419
2                 X2           3.999563
>>> onlinelr.intermediate_result_.collect()
  SEQUENCE                                 INTERMEDIATE_MODEL
0        0  {"algorithm":"batch_algorithm","batch_algorith...

Round 3, invoke partial_fit() for training the model with df_3:

>>> onlinelr.partial_fit(df_3, key='ID', label='Y', features=['X1', 'X2'])

Output:

>>> onlinelr.coefficients_.collect()
       VARIABLE_NAME  COEFFICIENT_VALUE
0  __PAL_INTERCEPT__           5.073338
1                 X1           2.994118
2                 X2           3.999389
>>> onlinelr.intermediate_result_.collect()
  SEQUENCE                                 INTERMEDIATE_MODEL
0        0  {"algorithm":"batch_algorithm","batch_algorith...

Call predict() with df_predict:

>>> df_predict.collect()
   ID    X1    X2
0  14     2    67
1  15     3    51

Invoke predict():

>>> fitted = onlinelr.predict(df_predict, key='ID', features=['X1', 'X2'])
>>> fitted.collect()
  ID       VALUE
0 14  279.020611
1 15  218.024511

Call score()

>>> score = onlinelr.score(df_2, key='ID', label='Y', features=['X1', 'X2'])
0.9999997918249237
Attributes
intermediate_result_DataFrame

Intermediate model.

coefficients_DataFrame

Fitted regression coefficients.

Methods

partial_fit(data[, key, features, label, ...])

Online training based on each round of training data.

predict(data[, key, features])

Predict dependent variable values based on a fitted model.

score(data[, key, features, label])

Returns the coefficient of determination R2 of the prediction.

partial_fit(data, key=None, features=None, label=None, thread_ratio=None)

Online training based on each round of training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID column.

If key is not provided, then:

  • if data is indexed by a single column, then key defaults to that index column;

  • otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

If label is not provided, it defaults to the last column.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

Returns
OnlineLinearRegression

A fitted object.

predict(data, key=None, features=None)

Predict dependent variable values based on a fitted model.

Parameters
dataDataFrame

Independent variable values to predict for.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

Returns
DataFrame

Predicted values, structured as follows:

  • ID column: with same name and type as data 's ID column.

  • VALUE: type DOUBLE, representing predicted values.

score(data, key=None, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters
dataDataFrame

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

If label is not provided, it defaults to the last column.

Returns
float

Returns the coefficient of determination R2 of the prediction.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the OnlineLinearRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.