OnlineLinearRegression

class hana_ml.algorithms.pal.linear_model.OnlineLinearRegression(enet_lambda=None, enet_alpha=None, max_iter=None, tol=None)

Online linear regression (Stateless) is an online version of the linear regression and is used when the training data are obtained multiple rounds. Additional data are obtained in each round of training. By making use of the current computed linear model and combining with the obtained data in each round, online linear regression adapts the linear model to make the prediction as precise as possible.

Note

We currently support Online Linear Regression(stateless) in SAP HANA Cloud. Online Linear Regression(stateful) version available in SAP HANA SPS05/06 has not been supported in hana-ml yet.

Parameters:

enet_lambdafloat, optional

Penalized weight. Value should be greater than or equal to 0.

Defaults to 0.

enet_alphafloat, optional

Elastic net mixing parameter. Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively.

Defaults to 0.

max_iterint, optional

Maximum iterative cycle. Defaults to 1000.

tolfloat, optional

Convergence threshold. Defaults to 1.0e-5.

Examples

First, initialize an online linear regression instance:

>>> onlinelr = OnlineLinearRegression(enet_lambda=0.1,
                                      enet_alpha=0.5,
                                      max_iter=1200,
                                      tol=1E-6)

Three rounds of data:

>>> df_1.collect()
  ID      Y    X1    X2
0  1  130.0   7.0  26.0
1  2  124.0   1.0  29.0
2  3  262.0  11.0  56.0
3  4  162.0  11.0  31.0

>>> df_2.collect()
   ID      Y    X1    X2
0   5  234.0   7.0  52.0
1   6  258.0  11.0  55.0
2   7  298.0   3.0  71.0
3   8  132.0   1.0  31.0

>>> df_3.collect()
   ID      Y    X1    X2
 9  227.0   2.0  54.0
10  256.0  21.0  47.0
11  168.0   1.0  40.0
12  302.0  11.0  66.0
13  307.0  10.0  68.0

Round 1, invoke partial_fit() for training the model with df_1:

>>> onlinelr.partial_fit(df_1, key='ID', label='Y', features=['X1', 'X2'])

Output:

>>> onlinelr.coefficients_.collect()
      VARIABLE_NAME   COEFFICIENT_VALUE
0  __PAL_INTERCEPT__           5.076245
1                 X1           2.987277
2                 X2           4.000540

>>> onlinelr.intermediate_result_.collect()
  SEQUENCE                                 INTERMEDIATE_MODEL
0        0  {"algorithm":"batch_algorithm","batch_algorith...

Round 2, invoke partial_fit() for training the model with df_2:

>>> onlinelr.partial_fit(df_2, key='ID', label='Y', features=['X1', 'X2'])

Output:

>>> onlinelr.coefficients_.collect()
       VARIABLE_NAME  COEFFICIENT_VALUE
0  __PAL_INTERCEPT__           5.094444
1                 X1           2.988419
2                 X2           3.999563

>>> onlinelr.intermediate_result_.collect()
  SEQUENCE                                 INTERMEDIATE_MODEL
0        0  {"algorithm":"batch_algorithm","batch_algorith...

Round 3, invoke partial_fit() for training the model with df_3:

>>> onlinelr.partial_fit(df_3, key='ID', label='Y', features=['X1', 'X2'])

Output:

>>> onlinelr.coefficients_.collect()
       VARIABLE_NAME  COEFFICIENT_VALUE
0  __PAL_INTERCEPT__           5.073338
1                 X1           2.994118
2                 X2           3.999389

>>> onlinelr.intermediate_result_.collect()
  SEQUENCE                                 INTERMEDIATE_MODEL
0        0  {"algorithm":"batch_algorithm","batch_algorith...

Call predict() with df_predict:

>>> df_predict.collect()
   ID    X1    X2
0  14     2    67
1  15     3    51

Invoke predict():

>>> fitted = onlinelr.predict(df_predict, key='ID', features=['X1', 'X2'])
>>> fitted.collect()
  ID       VALUE
0 14  279.020611
1 15  218.024511

Call score()

>>> score = onlinelr.score(df_2, key='ID', label='Y', features=['X1', 'X2'])
0.9999997918249237

Attributes:

intermediate_result_DataFrame: Intermediate model.
coefficients_DataFrame: Fitted regression coefficients.

Methods

`partial_fit`(data[, key, features, label, ...])	Online training based on each round of training data.
`predict`(data[, key, features])	Predict dependent variable values based on a fitted model.
`score`(data[, key, features, label])	Returns the coefficient of determination R2 of the prediction.

partial_fit(data, key=None, features=None, label=None, thread_ratio=None)

Online training based on each round of training data.

Parameters:

dataDataFrame

Training data.

keystr, optional

Name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;
otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

If label is not provided, it defaults to the last column.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.0.

Returns:

OnlineLinearRegression: A fitted object.

predict(data, key=None, features=None)

Predict dependent variable values based on a fitted model.

Parameters:

dataDataFrame

Independent variable values to predict for.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

Returns:

DataFrame

Predicted values, structured as follows:

ID column: with same name and type as data 's ID column.

VALUE: type DOUBLE, representing predicted values.

score(data, key=None, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters:

dataDataFrame

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

If label is not provided, it defaults to the last column.

Returns:

float: Returns the coefficient of determination R2 of the prediction.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the OnlineLinearRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.