OnlineLinearRegression
- class hana_ml.algorithms.pal.linear_model.OnlineLinearRegression(enet_lambda=None, enet_alpha=None, max_iter=None, tol=None)
Online linear regression (Stateless) is an online version of the linear regression and is used when the training data are obtained multiple rounds. Additional data are obtained in each round of training. By making use of the current computed linear model and combining with the obtained data in each round, online linear regression adapts the linear model to make the prediction as precise as possible.
Note
We currently support Online Linear Regression(stateless) in SAP HANA Cloud. Online Linear Regression(stateful) version available in SAP HANA SPS05/06 has not been supported in hana-ml yet.
- Parameters
- enet_lambdafloat, optional
Penalized weight. Value should be greater than or equal to 0.
Defaults to 0.
- enet_alphafloat, optional
Elastic net mixing parameter. Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively.
Defaults to 0.
- max_iterint, optional
Maximum iterative cycle. Defaults to 1000.
- tolfloat, optional
Convergence threshold. Defaults to 1.0e-5.
Examples
First, initialize an online linear regression instance:
>>> onlinelr = OnlineLinearRegression(enet_lambda=0.1, enet_alpha=0.5, max_iter=1200, tol=1E-6)
Three rounds of data:
>>> df_1.collect() ID Y X1 X2 0 1 130.0 7.0 26.0 1 2 124.0 1.0 29.0 2 3 262.0 11.0 56.0 3 4 162.0 11.0 31.0
>>> df_2.collect() ID Y X1 X2 0 5 234.0 7.0 52.0 1 6 258.0 11.0 55.0 2 7 298.0 3.0 71.0 3 8 132.0 1.0 31.0
>>> df_3.collect() ID Y X1 X2 0 9 227.0 2.0 54.0 1 10 256.0 21.0 47.0 2 11 168.0 1.0 40.0 3 12 302.0 11.0 66.0 4 13 307.0 10.0 68.0
Round 1, invoke partial_fit() for training the model with df_1:
>>> onlinelr.partial_fit(df_1, key='ID', label='Y', features=['X1', 'X2'])
Output:
>>> onlinelr.coefficients_.collect() VARIABLE_NAME COEFFICIENT_VALUE 0 __PAL_INTERCEPT__ 5.076245 1 X1 2.987277 2 X2 4.000540
>>> onlinelr.intermediate_result_.collect() SEQUENCE INTERMEDIATE_MODEL 0 0 {"algorithm":"batch_algorithm","batch_algorith...
Round 2, invoke partial_fit() for training the model with df_2:
>>> onlinelr.partial_fit(df_2, key='ID', label='Y', features=['X1', 'X2'])
Output:
>>> onlinelr.coefficients_.collect() VARIABLE_NAME COEFFICIENT_VALUE 0 __PAL_INTERCEPT__ 5.094444 1 X1 2.988419 2 X2 3.999563
>>> onlinelr.intermediate_result_.collect() SEQUENCE INTERMEDIATE_MODEL 0 0 {"algorithm":"batch_algorithm","batch_algorith...
Round 3, invoke partial_fit() for training the model with df_3:
>>> onlinelr.partial_fit(df_3, key='ID', label='Y', features=['X1', 'X2'])
Output:
>>> onlinelr.coefficients_.collect() VARIABLE_NAME COEFFICIENT_VALUE 0 __PAL_INTERCEPT__ 5.073338 1 X1 2.994118 2 X2 3.999389
>>> onlinelr.intermediate_result_.collect() SEQUENCE INTERMEDIATE_MODEL 0 0 {"algorithm":"batch_algorithm","batch_algorith...
Call predict() with df_predict:
>>> df_predict.collect() ID X1 X2 0 14 2 67 1 15 3 51
Invoke predict():
>>> fitted = onlinelr.predict(df_predict, key='ID', features=['X1', 'X2']) >>> fitted.collect() ID VALUE 0 14 279.020611 1 15 218.024511
Call score()
>>> score = onlinelr.score(df_2, key='ID', label='Y', features=['X1', 'X2']) 0.9999997918249237
- Attributes
- intermediate_result_DataFrame
Intermediate model.
- coefficients_DataFrame
Fitted regression coefficients.
Methods
partial_fit
(data[, key, features, label, ...])Online training based on each round of training data.
predict
(data[, key, features])Predict dependent variable values based on a fitted model.
score
(data[, key, features, label])Returns the coefficient of determination R2 of the prediction.
- partial_fit(data, key=None, features=None, label=None, thread_ratio=None)
Online training based on each round of training data.
- Parameters
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
If
label
is not provided, it defaults to the last column.- thread_ratiofloat, optional
Controls the proportion of available threads to use.
The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.0.
- Returns
- OnlineLinearRegression
A fitted object.
- predict(data, key=None, features=None)
Predict dependent variable values based on a fitted model.
- Parameters
- dataDataFrame
Independent variable values to predict for.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.
- Returns
- DataFrame
Predicted values, structured as follows:
ID column: with same name and type as
data
's ID column.VALUE: type DOUBLE, representing predicted values.
- score(data, key=None, features=None, label=None)
Returns the coefficient of determination R2 of the prediction.
- Parameters
- dataDataFrame
Data on which to assess model performance.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
If
label
is not provided, it defaults to the last column.
- Returns
- float
Returns the coefficient of determination R2 of the prediction.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.