GLM

class hana_ml.algorithms.pal.regression.GLM(family=None, link=None, solver=None, handle_missing_fit=None, quasilikelihood=None, max_iter=None, tol=None, significance_level=None, output_fitted=None, alpha=None, lamb=None, num_lambda=None, lambda_min_ratio=None, categorical_variable=None, ordering=None, thread_ratio=0.0, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, enet_lambda=None, enet_alpha=None)

Regression by a generalized linear model, based on PAL_GLM. Also supports ordinal regression.

Parameters:
family{'gaussian', 'normal', 'poisson', 'binomial', 'gamma', 'inversegaussian', 'negativebinomial', 'ordinal'}, optional

The kind of distribution the dependent variable outcomes are assumed to be drawn from.

Defaults to 'gaussian'.

linkstr, optional

GLM link function. Determines the relationship between the linear predictor and the predicted response.

Default and allowed values depend on family. 'inverse' is accepted as a synonym of 'reciprocal'.

family

default link

allowed values of link

gaussian

identity

identity, log, reciprocal

poisson

log

identity, log

binomial

logit

logit, probit, comploglog, log

gamma

reciprocal

identity, reciprocal, log

inversegaussian

inversesquare

inversesquare, identity, reciprocal, log

negativebinomial

log

identity, log, sqrt

ordinal

logit

logit, probit, comploglog

solver{'irls', 'nr', 'cd'}, optional

Optimization algorithm to use.

  • 'irls': Iteratively re-weighted least squares.

  • 'nr': Newton-Raphson.

  • 'cd': Coordinate descent. (Picking coordinate descent activates elastic net regularization.)

Defaults to 'irls', except when family is 'ordinal'.

Ordinal regression requires (and defaults to) 'nr', and Newton-Raphson is not supported for other values of family.

handle_missing_fit{'skip', 'abort', 'fill_zero'}, optional

How to handle data rows with missing independent variable values during fitting.

  • 'skip': Don't use those rows for fitting.

  • 'abort': Throw an error if missing independent variable values are found.

  • 'fill_zero': Replace missing values with 0.

Defaults to 'skip'.

quasilikelihoodbool, optional

If True, enables the use of quasi-likelihood to estimate overdispersion.

Defaults to False.

max_iterint, optional

Maximum number of optimization iterations.

Defaults to 100 for IRLS and Newton-Raphson.

Defaults to 100000 for coordinate descent.

tolfloat, optional

Stopping condition for optimization.

Defaults to 1e-8 for IRLS, 1e-6 for Newton-Raphson, and 1e-7 for coordinate descent.

significance_levelfloat, optional

Significance level for confidence intervals and prediction intervals.

Defaults to 0.05.

output_fittedbool, optional

If True, create the fitted_ DataFrame of fitted response values for training data in fit.

Defaults to False.

alphafloat, optional(deprecated)

Elastic net mixing parameter. Only accepted when using coordinate descent. Should be between 0 and 1 inclusive.

Defaults to 1.0.

Deprecated, please use enet_alpha instead.

lambfloat, optional(deprecated)

Coefficient(lambda) value for elastic-net regularization.

Valid only when solver is 'cd'.

No default value.

Deprecated, please use enet_lambda instead.

num_lambdaint, optional

The number of lambda values. Only accepted when using coordinate descent.

Defaults to 100.

lambda_min_ratiofloat, optional

The smallest value of lambda, as a fraction of the maximum lambda, where lambda_max is the smallest value for which all coefficients are zero. Only accepted when using coordinate descent.

Defaults to 0.01 when the number of observations is smaller than the number of covariates, and 0.0001 otherwise.

categorical_variablelist of str, optional

INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.

orderinglist of str or list of int, optional

Specifies the order of categories for ordinal regression.

The default is numeric order for ints and alphabetical order for strings.

thread_ratiofloat, optional

Controls the proportion of available threads to use for fitting.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

resampling_method{'cv', 'bootstrap'}, optional

Specifies the resampling method for model evaluation/parameter selection.

If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.

Must be set together with evaluation_metric.

No default value.

evaluation_metric{'rmse', 'mae', 'error_rate'}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

Must be set together with resampling_method.

'error_rate' applies only for ordinal regression.

No default value.

fold_numint, optional

Specifies the fold number for the cross validation method.

Mandatory and valid only when resampling_method is set to 'cv'.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Defaults to 1.

search_strategy{'grid', 'random'}, optional

Specifies the method to activate parameter selection.

No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is set to 'random'.

No default value.

random_stateint, optional

Specifies the seed for random generation. Use system time when 0 is specified.

Defaults to 0.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter

selection, in seconds.

No timeout when 0 is specified.

Defaults to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection.

No progress indicator is active if no value is provided.

No default value.

param_valuesdict or list of tuples, optional

Specifies values of specific parameters to be selected.

Valid only when parameter selection is activated.

Specified parameters could be link, enet_lambda and enet_alpha.

No default value.

param_rangedict or list of tuples, optional

Specifies range of specific parameters to be selected.

Valid only when parameter selection is activated.

Specified parameters could be enet_lambda, enet_alpha.

No default value.

enet_alphafloat, optional

Elastic-net regularization mixing parameter. Only accepted when using coordinate descent(i.e. when solver is 'cd').

Should be between 0 and 1 inclusive.

Defaults to 1.0.

enet_lambdafloat, optional

Penalty weight for elastic-net regularization.

Valid only when solver is 'cd'.

No default value.

Examples

Training data:

>>> df.collect()
   ID  Y  X
0   1  0 -1
1   2  0 -1
2   3  1  0
3   4  1  0
4   5  1  0
5   6  1  0
6   7  2  1
7   8  2  1
8   9  2  1

Fitting a GLM on that data:

>>> glm = GLM(solver='irls', family='poisson', link='log')
>>> glm.fit(data=df, key='ID', label='Y')

Performing prediction:

>>> df2.collect()
   ID  X
0   1 -1
1   2  0
2   3  1
3   4  2
>>> glm.predict(data=df2, key='ID')[['ID', 'PREDICTION']].collect()
   ID           PREDICTION
0   1  0.25543735346197155
1   2    0.744562646538029
2   3   2.1702915689746476
3   4     6.32608352871737
Attributes:
statistics_DataFrame

Training statistics and model information other than the coefficients and covariance matrix.

coef_DataFrame

Model coefficients.

covmat_DataFrame

Covariance matrix. Set to None for coordinate descent.

fitted_DataFrame

Predicted values for the training data. Set to None if output_fitted is False.

Methods

fit(data[, key, features, label, ...])

Fit a generalized linear model based on training data.

predict(data[, key, features, ...])

Predict dependent variable values based on fitted model.

score(data[, key, features, label, ...])

Returns the coefficient of determination R2 of the prediction.

fit(data, key=None, features=None, label=None, categorical_variable=None, dependent_variable=None, excluded_feature=None)

Fit a generalized linear model based on training data.

Parameters:
dataDataFrame

Training data.

keystr, optional

Name of the ID column.

If key is not provided, then:

  • if data is indexed by a single column, then key defaults to that index column;

  • otherwise, it is assumed that data contains no ID column.

Required when output_fitted is True(in initialization).

featureslist of str or str, optional

Names of the feature columns.

Defaults to all non-ID, non-label columns.

labelstr or list of str, optional

Name of the dependent variable.

Defaults to the last non-ID column(this is not the PAL default.)

When family is 'binomial', label may be either a single column name or a list of two column names.

categorical_variablelist of str or str, optional

INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.

dependent_variablestr, optional(deprecated and ineffective)

Only used when you need to indicate the dependence.

Please use label instead.

excluded_featurelist of str, optional(deprecated and ineffective)

Excludes the indicated feature column.

If necessary, please use features instead.

Defaults to None.

Returns:
Fitted object.
predict(data, key=None, features=None, prediction_type=None, significance_level=None, handle_missing=None, thread_ratio=0.0)

Predict dependent variable values based on fitted model.

Parameters:
dataDataFrame

Independent variable values to predict for.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

Defaults to all non-ID columns.

prediction_type{'response', 'link'}, optional

Specifies whether to output predicted values of the response or the link function.

Defaults to 'response'.

significance_levelfloat, optional

Significance level for confidence intervals and prediction intervals. If specified, overrides the value passed to the GLM constructor.

handle_missing{'skip', 'fill_zero'}, optional

How to handle data rows with missing independent variable values.

  • 'skip': Don't perform prediction for those rows.

  • 'fill_zero': Replace missing values with 0.

Defaults to 'skip'.

thread_ratiofloat, optional

Controls the proportion of available threads to use for prediction.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Returns:
DataFrame

Predicted values, structured as follows. The following two columns are always populated:

  • ID column, with same name and type as data's ID column.

  • PREDICTION, type NVARCHAR(100), representing predicted values.

The following five columns are only populated for IRLS:

  • SE, type DOUBLE. Standard error, or for ordinal regression, the probability that the data point belongs to the predicted category.

  • CI_LOWER, type DOUBLE. Lower bound of the confidence interval.

  • CI_UPPER, type DOUBLE. Upper bound of the confidence interval.

  • PI_LOWER, type DOUBLE. Lower bound of the prediction interval.

  • PI_UPPER, type DOUBLE. Upper bound of the prediction interval.

score(data, key=None, features=None, label=None, prediction_type=None, handle_missing=None)

Returns the coefficient of determination R2 of the prediction.

Not applicable for ordinal regression.

Parameters:
dataDataFrame

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

Defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last non-ID column(this is not the PAL default).

Cannot be two columns, even when family is 'binomial' when initializing the GLM class instance.

prediction_type{'response', 'link'}, optional

Specifies whether to predict the value of the response or the link function.

The contents of the label column should match this choice.

Defaults to 'response'.

handle_missing{'skip', 'fill_zero'}, optional

How to handle data rows with missing independent variable values.

  • 'skip': Don't perform prediction for those rows. Those rows will be left out of the R2 computation.

  • 'fill_zero': Replace missing values with 0.

Defaults to 'skip'.

Returns:
float

The coefficient of determination R2 of the prediction on the given data.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the GLM class also inherits methods from PALBase class, please refer to PAL Base for more details.