PolynomialRegression
- class hana_ml.algorithms.pal.regression.PolynomialRegression(degree=None, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=0.0, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, degree_values=None, degree_range=None)
Polynomial regression is an approach to model the relationship between a scalar variable y and a variable denoted X. In polynomial regression, data is modeled using polynomial functions, and unknown model parameters are estimated from the data. Such models are called polynomial models.
- Parameters:
- degreeint
Degree of the polynomial model.
- decomposition{'LU', 'QR', 'SVD', 'Cholesky'}, optional
Matrix factorization type to use. Case-insensitive.
'LU': LU decomposition.
'QR': QR decomposition.
'SVD': singular value decomposition.
'Cholesky': Cholesky(LDLT) decomposition.
Defaults to QR decomposition.
- adjusted_r2bool, optional
If true, include the adjusted R2 value in the statistics table.
Defaults to False.
- pmml_export{'no', 'single-row', 'multi-row'}, optional
Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.
'no' or not provided: No PMML model.
'single-row': Exports a PMML model in a maximum of one row. Fails if the model doesn't fit in one row.
'multi-row': Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.
Prediction does not require a PMML model.
- thread_ratiofloat, optional(deprecated)
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- resampling_method{'cv', 'bootstrap'}, optional
Specifies the resampling method for model evaluation/parameter selection.
If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.
Must be set together with
evaluation_metric
.No default value.
- evaluation_metric{'rmse'}, optional
Specifies the evaluation metric for model evaluation or parameter selection.
Must be set together with
resampling_method
.No default value.
- fold_numint, optional
Specifies the fold number for the cross validation method. Mandatory and valid only when
resampling_method
is set to 'cv'.No default value.
- repeat_timesint, optional
Specifies the number of repeat times for resampling.
Defaults to 1.
- search_strategy{'grid', 'random'}, optional
Specifies the method to activate parameter selection.
No default value.
- random_search_timesint, optional
Specifies the number of times to randomly select candidate parameters for selection.
Mandatory and valid when
search_strategy
is set to 'random'.No default value.
- random_stateint, optional
Specifies the seed for random generation. Use system time when 0 is specified.
Defaults to 0.
- timeoutint, optional
Specifies maximum running time for model evaluation or parameter selection, in seconds.
No timeout when 0 is specified.
Defaults to 0.
- progress_indicator_idstr, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
No default value.
- degree_valueslist of int, optional
Specifies values of
degree
to be selected.Only valid when
search_strategy
is specified.No default value.
- degree_rangelist of int, optional
Specifies range of
degree
to be selected.Only valid when
search_strategy
is specified.No default value.
Examples
Training data (based on \(y = x^3 - 2x^2 + 3x + 5\), with noise):
>>> df.collect() ID X Y 0 1 0.0 5.048 1 2 1.0 7.045 2 3 2.0 11.003 3 4 3.0 23.072 4 5 4.0 49.041
Training the model:
>>> pr = PolynomialRegression(degree=3) >>> pr.fit(data=df, key='ID')
Prediction:
>>> df2.collect() ID X 0 1 0.5 1 2 1.5 2 3 2.5 3 4 3.5 >>> pr.predict(data=df_predict, key='ID').collect() ID VALUE 0 1 6.157063 1 2 8.401269 2 3 15.668581 3 4 33.928501
Ideal output:
>>> df2.select('ID', ('POWER(X, 3)-2*POWER(X, 2)+3*x+5', 'Y')).collect() ID Y 0 1 6.125 1 2 8.375 2 3 15.625 3 4 33.875
- Attributes:
- coefficients_DataFrame
Fitted regression coefficients.
- pmml_DataFrame
PMML model. Set to None if no PMML model was requested.
- fitted_DataFrame
Predicted dependent variable values for training data. Set to None if the training data has no row IDs.
- statistics_DataFrame
Regression-related statistics, such as mean squared error.
- optim_param_DataFrame
If cross validation is enabled, the optimal parameters will be selected.
Methods
fit
(data[, key, features, label])Fit the model to the training dataset.
predict
(data[, key, features, model_format, ...])Predict dependent variable values based on fitted model.
score
(data[, key, features, label])Returns the coefficient of determination R2 of the prediction.
- fit(data, key=None, features=None, label=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresa list of str, optional
Names of the feature columns.
Since the underlying PAL procedure for polynomial regression algorithm only supports one feature, this list can only contain one element.
If
features
is not provided,data
must have exactly 1 non-ID, non-label column, andfeatures
defaults to that column.- labelstr, optional
Name of the dependent variable.
Defaults to the last non-ID column. (This is not the PAL default.)
- Returns:
- A fitted object of class "PolynomialRegression".
- predict(data, key=None, features=None, model_format=None, thread_ratio=0.0)
Predict dependent variable values based on fitted model.
- Parameters:
- dataDataFrame
Independent variable values used for prediction.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
Since the underlying PAL procedure for polynomial regression only supports one feature, this list can only contain one element.
If
features
is not provided,data
must have exactly 1 non-ID column, andfeatures
defaults to that column.- model_formatint or str, optional(deprecated)
0 or 'coefficient' : using coefficient table as model for prediction
1 or 'pmml' : using pmml table as model for prediction
Defaults to 'coefficient'.
Deprecated, not effective any more.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- Returns:
- DataFrame
Predicted values, structured as follows:
ID column, with same name and type as
data
's ID column.VALUE, type DOUBLE, representing predicted values.
Note
predict() will pass the
pmml_
table to PAL as the model representation if there is apmml_
table, or thecoefficients_
table otherwise.
- score(data, key=None, features=None, label=None)
Returns the coefficient of determination R2 of the prediction.
- Parameters:
- dataDataFrame
Data on which to assess model performance.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
Since the underlying PAL procedure for polynomial regression prediction only supports one feature, this list can only contain one element.
If
features
is not provided,data
must have exactly 1 non-ID, non-label column, andfeatures
defaults to that column.- labelstr, optional
Name of the dependent variable.
Defaults to the last non-ID column(this is not the PAL default.)
- Returns:
- float
The coefficient of determination R2 of the prediction on the given data.
Inherited Methods from PALBase
Besides those methods mentioned above, the PolynomialRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.