PolynomialRegression

class hana_ml.algorithms.pal.regression.PolynomialRegression(degree=None, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=0.0, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, degree_values=None, degree_range=None)

Polynomial regression is an approach to model the relationship between a scalar variable y and a variable denoted X. In polynomial regression, data is modeled using polynomial functions, and unknown model parameters are estimated from the data. Such models are called polynomial models.

Parameters:

degreeint

Degree of the polynomial model.

decomposition{'LU', 'QR', 'SVD', 'Cholesky'}, optional

Matrix factorization type to use. Case-insensitive.

'LU': LU decomposition.

'QR': QR decomposition.

'SVD': singular value decomposition.

'Cholesky': Cholesky(LDLT) decomposition.

Defaults to QR decomposition.

adjusted_r2bool, optional

If true, include the adjusted R2 value in the statistics table.

Defaults to False.

pmml_export{'no', 'single-row', 'multi-row'}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

'no' or not provided: No PMML model.

'single-row': Exports a PMML model in a maximum of one row. Fails if the model doesn't fit in one row.

'multi-row': Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.

Prediction does not require a PMML model.

thread_ratiofloat, optional

Controls the proportion of available threads to use for fitting.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

resampling_method{'cv', 'bootstrap'}, optional

Specifies the resampling method for model evaluation/parameter selection.

If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.

Must be set together with evaluation_metric.

No default value.

evaluation_metric{'rmse'}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

Must be set together with resampling_method.

No default value.

fold_numint, optional

Specifies the fold number for the cross validation method. Mandatory and valid only when resampling_method is set to 'cv'.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Defaults to 1.

search_strategy{'grid', 'random'}, optional

Specifies the method to activate parameter selection.

No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is set to 'random'.

No default value.

random_stateint, optional

Specifies the seed for random generation. Use system time when 0 is specified.

Defaults to 0.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds.

No timeout when 0 is specified.

Defaults to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection.

No progress indicator is active if no value is provided.

No default value.

degree_valueslist of int, optional

Specifies values of degree to be selected.

Only valid when search_strategy is specified.

No default value.

degree_rangelist of int, optional

Specifies range of degree to be selected.

Only valid when search_strategy is specified.

No default value.

Examples

Training data (based on \(y = x^3 - 2x^2 + 3x + 5\), with noise):

>>> df.collect()
   ID    X       Y
 1  0.0   5.048
 2  1.0   7.045
 3  2.0  11.003
 4  3.0  23.072
 5  4.0  49.041

Training the model:

>>> pr = PolynomialRegression(degree=3)
>>> pr.fit(data=df, key='ID')

Prediction:

>>> df2.collect()
   ID    X
0   1  0.5
1   2  1.5
2   3  2.5
3   4  3.5
>>> pr.predict(data=df2, key='ID').collect()
   ID      VALUE
0   1   6.157063
1   2   8.401269
2   3  15.668581
3   4  33.928501

Ideal output:

>>> df2.select('ID', ('POWER(X, 3)-2*POWER(X, 2)+3*x+5', 'Y')).collect()
   ID       Y
0   1   6.125
1   2   8.375
2   3  15.625
3   4  33.875

Attributes:

coefficients_DataFrame: Fitted regression coefficients.
pmml_DataFrame: PMML model. Set to None if no PMML model was requested.
fitted_DataFrame: Predicted dependent variable values for training data. Set to None if the training data has no row IDs.
statistics_DataFrame: Regression-related statistics, such as mean squared error.
optim_param_DataFrame: If cross validation is enabled, the optimal parameters will be selected.

Methods

`fit`(data[, key, features, label])	Fit regression model based on training data.
`predict`(data[, key, features, model_format, ...])	Predict dependent variable values based on fitted model.
`score`(data[, key, features, label])	Returns the coefficient of determination R2 of the prediction.

fit(data, key=None, features=None, label=None)

Fit regression model based on training data.

Parameters:

dataDataFrame

Training data.

keystr, optional

Name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

Since the underlying PAL procedure for polynomial regression algorithm only supports one feature, this list can only contain one element.

If features is not provided, data must have exactly 1 non-ID, non-label column, and features defaults to that column.

labelstr, optional

Name of the dependent variable.

Defaults to the last non-ID column. (This is not the PAL default.)

Returns:

Fitted object.

predict(data, key=None, features=None, model_format=None, thread_ratio=0.0)

Predict dependent variable values based on fitted model.

Parameters:

dataDataFrame

Independent variable values used for prediction.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

Since the underlying PAL procedure for polynomial regression only supports one feature, this list can only contain one element.

If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.

model_formatint or str, optional(deprecated)

0 or 'coefficient' : using coefficient table as model for prediction
1 or 'pmml' : using pmml table as model for prediction

Defaults to 'coefficient'.

Deprecated, not effective any more.

thread_ratiofloat, optional

Controls the proportion of available threads to use for prediction.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

Returns:

DataFrame

Predicted values, structured as follows:

ID column, with same name and type as data's ID column.

VALUE, type DOUBLE, representing predicted values.

Note

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

score(data, key=None, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters:

dataDataFrame

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

Since the underlying PAL procedure for polynomial regression prediction only supports one feature, this list can only contain one element.

If features is not provided, data must have exactly 1 non-ID, non-label column, and features defaults to that column.

labelstr, optional

Name of the dependent variable.

Defaults to the last non-ID column(this is not the PAL default.)

Returns:

float: The coefficient of determination R2 of the prediction on the given data.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the PolynomialRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.