GLM
- class hana_ml.algorithms.pal.regression.GLM(family=None, link=None, solver=None, handle_missing_fit=None, quasilikelihood=None, max_iter=None, tol=None, significance_level=None, output_fitted=None, alpha=None, lamb=None, num_lambda=None, lambda_min_ratio=None, categorical_variable=None, ordering=None, thread_ratio=0.0, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, param_values=None, param_range=None, enet_lambda=None, enet_alpha=None)
Regression by a generalized linear model, based on PAL_GLM. Also supports ordinal regression.
- Parameters:
- family{'gaussian', 'normal', 'poisson', 'binomial', 'gamma', 'inversegaussian', 'negativebinomial', 'ordinal'}, optional
The kind of distribution the dependent variable outcomes are assumed to be drawn from.
Defaults to 'gaussian'.
- linkstr, optional
GLM link function. Determines the relationship between the linear predictor and the predicted response.
Default and allowed values depend on
family
. 'inverse' is accepted as a synonym of 'reciprocal'.family
default link
allowed values of link
gaussian
identity
identity, log, reciprocal
poisson
log
identity, log
binomial
logit
logit, probit, comploglog, log
gamma
reciprocal
identity, reciprocal, log
inversegaussian
inversesquare
inversesquare, identity, reciprocal, log
negativebinomial
log
identity, log, sqrt
ordinal
logit
logit, probit, comploglog
- solver{'irls', 'nr', 'cd'}, optional
Optimization algorithm to use.
'irls': Iteratively re-weighted least squares.
'nr': Newton-Raphson.
'cd': Coordinate descent. (Picking coordinate descent activates elastic net regularization.)
Defaults to 'irls', except when
family
is 'ordinal'.Ordinal regression requires (and defaults to) 'nr', and Newton-Raphson is not supported for other values of
family
.- handle_missing_fit{'skip', 'abort', 'fill_zero'}, optional
How to handle data rows with missing independent variable values during fitting.
'skip': Don't use those rows for fitting.
'abort': Throw an error if missing independent variable values are found.
'fill_zero': Replace missing values with 0.
Defaults to 'skip'.
- quasilikelihoodbool, optional
If True, enables the use of quasi-likelihood to estimate overdispersion.
Defaults to False.
- max_iterint, optional
Maximum number of optimization iterations.
Defaults to 100 for IRLS and Newton-Raphson.
Defaults to 100000 for coordinate descent.
- tolfloat, optional
Stopping condition for optimization.
Defaults to 1e-8 for IRLS, 1e-6 for Newton-Raphson, and 1e-7 for coordinate descent.
- significance_levelfloat, optional
Significance level for confidence intervals and prediction intervals.
Defaults to 0.05.
- output_fittedbool, optional
If True, create the
fitted_
DataFrame of fitted response values for training data in fit.Defaults to False.
- alphafloat, optional(deprecated)
Elastic net mixing parameter. Only accepted when using coordinate descent. Should be between 0 and 1 inclusive.
Defaults to 1.0.
Deprecated, please use
enet_alpha
instead.- lambfloat, optional(deprecated)
Coefficient(lambda) value for elastic-net regularization.
Valid only when
solver
is 'cd'.No default value.
Deprecated, please use
enet_lambda
instead.- num_lambdaint, optional
The number of lambda values. Only accepted when using coordinate descent.
Defaults to 100.
- lambda_min_ratiofloat, optional
The smallest value of lambda, as a fraction of the maximum lambda, where lambda_max is the smallest value for which all coefficients are zero. Only accepted when using coordinate descent.
Defaults to 0.01 when the number of observations is smaller than the number of covariates, and 0.0001 otherwise.
- categorical_variablelist of str, optional
INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.
- orderinglist of str or list of int, optional
Specifies the order of categories for ordinal regression.
The default is numeric order for ints and alphabetical order for strings.
- thread_ratiofloat, optional
Controls the proportion of available threads to use for fitting.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 0.
- resampling_method{'cv', 'bootstrap'}, optional
Specifies the resampling method for model evaluation/parameter selection.
If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.
Must be set together with
evaluation_metric
.No default value.
- evaluation_metric{'rmse', 'mae', 'error_rate'}, optional
Specifies the evaluation metric for model evaluation or parameter selection.
Must be set together with
resampling_method
.'error_rate' applies only for ordinal regression.
No default value.
- fold_numint, optional
Specifies the fold number for the cross validation method.
Mandatory and valid only when resampling_method is set to 'cv'.
No default value.
- repeat_timesint, optional
Specifies the number of repeat times for resampling.
Defaults to 1.
- search_strategy{'grid', 'random'}, optional
Specifies the method to activate parameter selection.
No default value.
- random_search_timesint, optional
Specifies the number of times to randomly select candidate parameters for selection.
Mandatory and valid when search_strategy is set to 'random'.
No default value.
- random_stateint, optional
Specifies the seed for random generation. Use system time when 0 is specified.
Defaults to 0.
- timeoutint, optional
Specifies maximum running time for model evaluation or parameter
selection, in seconds.
No timeout when 0 is specified.
Defaults to 0.
- progress_indicator_idstr, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
No default value.
- param_valuesdict or list of tuples, optional
Specifies values of specific parameters to be selected.
Valid only when parameter selection is activated.
Specified parameters could be
link
,enet_lambda
andenet_alpha
.No default value.
- param_rangedict or list of tuples, optional
Specifies range of specific parameters to be selected.
Valid only when parameter selection is activated.
Specified parameters could be
enet_lambda
,enet_alpha
.No default value.
- enet_alphafloat, optional
Elastic-net regularization mixing parameter. Only accepted when using coordinate descent(i.e. when
solver
is 'cd').Should be between 0 and 1 inclusive.
Defaults to 1.0.
- enet_lambdafloat, optional
Penalty weight for elastic-net regularization.
Valid only when
solver
is 'cd'.No default value.
Examples
Training data:
>>> df.collect() ID Y X 0 1 0 -1 1 2 0 -1 2 3 1 0 3 4 1 0 4 5 1 0 5 6 1 0 6 7 2 1 7 8 2 1 8 9 2 1
Fitting a GLM on that data:
>>> glm = GLM(solver='irls', family='poisson', link='log') >>> glm.fit(data=df, key='ID', label='Y')
Performing prediction:
>>> df2.collect() ID X 0 1 -1 1 2 0 2 3 1 3 4 2 >>> glm.predict(data=df2, key='ID')[['ID', 'PREDICTION']].collect() ID PREDICTION 0 1 0.25543735346197155 1 2 0.744562646538029 2 3 2.1702915689746476 3 4 6.32608352871737
- Attributes:
- statistics_DataFrame
Training statistics and model information other than the coefficients and covariance matrix.
- coef_DataFrame
Model coefficients.
- covmat_DataFrame
Covariance matrix. Set to None for coordinate descent.
- fitted_DataFrame
Predicted values for the training data. Set to None if
output_fitted
is False.
Methods
fit
(data[, key, features, label, ...])Fit a generalized linear model based on training data.
predict
(data[, key, features, ...])Predict dependent variable values based on fitted model.
score
(data[, key, features, label, ...])Returns the coefficient of determination R2 of the prediction.
- fit(data, key=None, features=None, label=None, categorical_variable=None, dependent_variable=None, excluded_feature=None)
Fit a generalized linear model based on training data.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
Required when
output_fitted
is True(in initialization).- featureslist of str or str, optional
Names of the feature columns.
Defaults to all non-ID, non-label columns.
- labelstr or list of str, optional
Name of the dependent variable.
Defaults to the last non-ID column(this is not the PAL default.)
When
family
is 'binomial',label
may be either a single column name or a list of two column names.- categorical_variablelist of str or str, optional
INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.
- dependent_variablestr, optional(deprecated and ineffective)
Only used when you need to indicate the dependence.
Please use
label
instead.- excluded_featurelist of str, optional(deprecated and ineffective)
Excludes the indicated feature column.
If necessary, please use
features
instead.Defaults to None.
- Returns:
- Fitted object.
- predict(data, key=None, features=None, prediction_type=None, significance_level=None, handle_missing=None, thread_ratio=0.0)
Predict dependent variable values based on fitted model.
- Parameters:
- dataDataFrame
Independent variable values to predict for.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
Defaults to all non-ID columns.
- prediction_type{'response', 'link'}, optional
Specifies whether to output predicted values of the response or the link function.
Defaults to 'response'.
- significance_levelfloat, optional
Significance level for confidence intervals and prediction intervals. If specified, overrides the value passed to the GLM constructor.
- handle_missing{'skip', 'fill_zero'}, optional
How to handle data rows with missing independent variable values.
'skip': Don't perform prediction for those rows.
'fill_zero': Replace missing values with 0.
Defaults to 'skip'.
- thread_ratiofloat, optional
Controls the proportion of available threads to use for prediction.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
- Returns:
- DataFrame
Predicted values, structured as follows. The following two columns are always populated:
ID column, with same name and type as
data
's ID column.PREDICTION, type NVARCHAR(100), representing predicted values.
The following five columns are only populated for IRLS:
SE, type DOUBLE. Standard error, or for ordinal regression, the probability that the data point belongs to the predicted category.
CI_LOWER, type DOUBLE. Lower bound of the confidence interval.
CI_UPPER, type DOUBLE. Upper bound of the confidence interval.
PI_LOWER, type DOUBLE. Lower bound of the prediction interval.
PI_UPPER, type DOUBLE. Upper bound of the prediction interval.
- score(data, key=None, features=None, label=None, prediction_type=None, handle_missing=None)
Returns the coefficient of determination R2 of the prediction.
Not applicable for ordinal regression.
- Parameters:
- dataDataFrame
Data on which to assess model performance.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
Defaults to all non-ID, non-label columns.
- labelstr, optional
Name of the dependent variable.
Defaults to the last non-ID column(this is not the PAL default).
Cannot be two columns, even when
family
is 'binomial' when initializing the GLM class instance.- prediction_type{'response', 'link'}, optional
Specifies whether to predict the value of the response or the link function.
The contents of the
label
column should match this choice.Defaults to 'response'.
- handle_missing{'skip', 'fill_zero'}, optional
How to handle data rows with missing independent variable values.
'skip': Don't perform prediction for those rows. Those rows will be left out of the R2 computation.
'fill_zero': Replace missing values with 0.
Defaults to 'skip'.
- Returns:
- float
The coefficient of determination R2 of the prediction on the given data.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
Inherited Methods from PALBase
Besides those methods mentioned above, the GLM class also inherits methods from PALBase class, please refer to PAL Base for more details.