LogisticRegression

class hana_ml.algorithms.pal.linear_model.LogisticRegression(multi_class=False, max_iter=None, pmml_export=None, categorical_variable=None, standardize=True, stat_inf=False, solver=None, enet_alpha=None, enet_lambda=None, tol=None, epsilon=None, thread_ratio=None, max_pass_number=None, sgd_batch_number=None, precompute=None, handle_missing=None, resampling_method=None, metric=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, lbfgs_m=None, class_map0=None, class_map1=None, progress_indicator_id=None, param_values=None, param_range=None, json_export=None, resource=None, max_resource=None, min_resource_rate=None, reduction_rate=None, aggressive_elimination=None, ps_verbose=None, onehot_min_frequency=None, onehot_max_categories=None)

Logistic regression models the relationship between a dichotomous dependent variable (also known as explained variable) and one or more continuous or categorical independent variables (also known as explanatory variables). It models the log odds of the dependent variable as a linear combination of the independent variables. LogisticRegression handles both binary-class and multi-class classification problems.

Parameters:
multi_classbool, optional

If True, perform multi-class classification. Otherwise, there must be only two classes.

Defaults to False.

max_iterint, optional

Maximum number of iterations taken for the solvers to converge. If convergence is not reached after this number, an error will be generated.

When solver is 'newton' or 'lbfgs', the convex optimizer may return suboptimal results after the maximum number of iterations. When solver is 'cyclical', if convergence is not reached after the maximum number of passes over training data, an error will be generated.

  • multi-class: Defaults to 100.

  • binary-class: Defaults to 100000 when solver is 'cyclical', 1000 when solver is 'proximal', otherwise 100.

pmml_export{'no', 'single-row', 'multi-row'}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

multi-class

  • 'no' or not provided: No PMML model.

  • 'multi-row': Exports logistic regression model in PMML.

binary-class

  • 'no' or not provided: No PMML model.

  • 'single-row': Exports a PMML model in a maximum of one row. Fails if the model doesn't fit in one row.

  • 'multi-row': Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.

In multi-class, both PMML and JSON format model can be exported. JSON format is preferred if both formats are to be exported.

Defaults to 'no'.

categorical_variablestr or a list of str, optional(deprecated)

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

standardizebool, optional

If true, standardize the data to have zero mean and unit variance.

Defaults to True.

stat_infbool, optional

If true, proceed with statistical inference.

Defaults to False.

solver{'auto', 'newton', 'cyclical', 'lbfgs', 'stochastic', 'proximal'}, optional

Optimization algorithm.

  • 'auto' : automatically determined by system based on input data and parameters.

  • 'newton': Newton iteration method, can only solve ridge regression problems.

  • 'cyclical': Cyclical coordinate descent method to fit elastic net regularized logistic regression.

  • 'lbfgs': LBFGS method(recommended when having many independent variables, can only solve ridge regression problems when multi_class is True).

  • 'stochastic': Stochastic gradient descent method(recommended when dealing with very large dataset), can only solve ridge regression problems.

  • 'proximal': Proximal gradient descent method to fit elastic net regularized logistic regression.

When multi_class is True, only 'auto', 'lbfgs' and 'cyclical' are valid solvers.

Defaults to 'auto'.

Note

If it happens that the enet regularization term contains LASSO penalty, while a solver that can only solve ridge regression problems is specified, then the specified solver will be ignored(hence default value is used). The users can check the statistical table for the solver that has been adopted finally.

enet_alphafloat, optional

The elastic net mixing parameter. The valid value range is between 0 and 1 inclusively(0: Ridge penalty, 1: LASSO penalty).

Defaults to 1.0.

enet_lambdafloat, optional

Penalized weight. The value should be equal to or greater than 0.

Defaults to 0.0.

tolfloat, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-7 when solver is cyclical, 1.0e-6 otherwise.

epsilonfloat, optional

Determines the accuracy with which the solution is to be found.

When solver is 'lbfgs', the condition is: \(\|g\|\) < epsilon * max {1, \(\|x\|\)}, where g is gradient of objective function, x is solve of current iteration, and \(\|\cdot\|\) denotes the L2 norm;

When solver is 'newton', the condition is: \(\|x- x'\|\) < epsilon * sqrt(n), where x is the solve of current iteration, x' is the previous iteration, and n is the number of features.

Only valid when multi_class is False and the solver is 'newton' or 'lbfgs'.

Defaults to 1.0e-6 when solver is 'newton', 1.0e-5 when solver is 'lbfgs'.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 1.0.

max_pass_numberint, optional

The maximum number of passes over the data.

Only valid when multi_class is False and (actual) solver is 'stochastic'.

Defaults to 1.

sgd_batch_numberint, optional

The batch number of Stochastic gradient descent.

Only valid when multi_class is False and (actual) solver is 'stochastic'.

Defaults to 1.

precomputebool, optional

Whether to pre-compute the Gram matrix.

Only valid when multi_class is False and (actual) solver is 'cyclical'.

Defaults to True.

handle_missingbool, optional
  • True : handle missing values.

  • False : do not handle missing values.

Only valid when multi_class is False.

Defaults to True.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

lbfgs_mint, optional

Number of previous updates to keep.

Only applicable when multi_class is False and solver is 'lbfgs'.

Defaults to 6.

resampling_methodstr, optional

Specifies the resampling method for model evaluation or parameter selection.

Valid resampling methods are listed as follows: 'cv', 'cv_sha', 'cv_hyperband', 'stratified_cv', 'stratified_cv_sha', 'stratified_cv_hyperband', 'bootstrap', 'bootstrap_sha', 'bootstrap_hyperband', 'stratified_bootstrap', 'stratified_bootstrap_sha', 'stratified_bootstrap_hyperband'.

Resampling methods with suffix 'sha' or 'hyperband' are only applicable to parameter selection, and currently these methods cannot be specified when multi_class is not True.

If no value specified, neither model evaluation nor parameter selection is activated.

metric{'accuracy', 'f1_score', 'auc', 'nll'}, optional(deprecated)

The evaluation metric used for model evaluation/parameter selection.

Deprecated, please use evaluation_metric instead.

evaluation_metric{'accuracy', 'f1_score', 'auc', 'nll'}, optional

The evaluation metric used for model evaluation/parameter selection.

Must be specified together with resampling_method to activate model-evaluation/parameter-selection.

fold_numint, optional

The number of folds for cross-validation.

Mandatory and valid only when resampling_method is cross-validation based(contains 'cv' in part, e.g. 'cv', 'stratified_cv_sha').

repeat_timesint, optional

The number of repeat times for resampling.

Defaults to 1.

search_strategy{'grid', 'random'}, optional

The search method for parameter selection.

random_search_timesint, optional

The number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is 'random'.

random_stateint, optional

The seed for random generation. 0 indicates using system time as seed.

Defaults to 0.

progress_indicator_idstr, optional

The ID of progress indicator for model evaluation/parameter selection.

Progress indicator deactivated if no value provided.

param_valuesdict or list of tuples, optional

Specifies values of specific parameters to be selected.

Valid only when resampling_method and search_strategy are specified.

Specific parameters can be enet_lambda, enet_alpha.

No default value.

param_rangedict or list of tuples, optional

Specifies range of specific parameters to be selected.

Valid only when resampling_method and search_strategy are specified.

Specific parameters can be enet_lambda, enet_alpha.

No default value.

class_map0str, optional (deprecated)

Categorical label to map to 0.

class_map0 is mandatory when label column type is VARCHAR or NVARCHAR

Only valid when multi_class is False during binary class fit and score.

class_map1str, optional (deprecated)

Categorical label to map to 1.

class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid when multi_class is False.

json_exportbool, optional
  • False : Does not export multiple Logistic Regression model in JSON.

  • True : Exports multiple Logistic Regression model in JSON.

Only valid when multi-class is True.

Currently either PMML or JSON format model can be exported. JSON format is preferred if both formats are to be exported.

Defaults to False.

resourcestr, optional

Specifies the resource type used in successive-halving and hyperband algorithm for parameter selection:

  • 'max_iter'

  • 'max_pass_number'

Mandatory and valid only when resampling_method is specified with suffix 'sha' or 'hyperband'.

If multi_class is set as True, then currently only 'max_iter' is valid; otherwise if multi_class is False, then

  • 'max_pass_number' is valid only when the actual solver is 'stochastic'

  • 'max_iter' is valid for other solvers

max_resourceint, optional

Maximum allowed resource budget for single hyper-parameter candidate, must be greater than 0.

Mandatory and valid only wen resource is set.

min_resource_ratefloat, optional

Specifies the minimum required resource budget compared to maximum resource for single hyper-parameter candidate. Valid value should be greater than or equal to 0, but less than 1.

Valid only when resource is set.

Defaults to 0.

reduction_ratefloat, optional

Specifies the reduction rate of available size of hyper-parameter candidates. For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0

Valid only when resource is set.

Defaults to 3.0.

aggressive_eliminationbool, optional

Specifies whether to perform aggressive elimination behavior for successive-halving algorithm or not.

When set to True, it will eliminate more parameter candidates than expected(defined via reduction_rate). This can enhance the run-time performance but could result in sub-optimal hyper-parameter candidate.

Valid only when resampling_method is specified with suffix 'sha'.

Defaults to False.

ps_verbosebool, optional

Specifies whether to output optimal hyper-parameter and all evaluation statistics of related hyper-parameter candidates in attribute statistics_ or not.

Defaults to True.

onehot_min_frequencyint, optional

Specifies the minimum frequency below which a category will be considered infrequent. Only available for multiclass.

Defaults to 1.

onehot_max_categoriesint, optional

Specifies an upper limit to the number of output features for each input feature. It includes the feature that combines infrequent categories. Only available for multiclass.

Defaults to 0.

Examples

>>> lr = linear_model.LogisticRegression(solver='newton', max_iter=1000,
                                         pmml_export='single-row', stat_inf=True, tol=0.000001)
>>> lr.fit(data=df, features=['V1', 'V2', 'V3'], label='CATEGORY')
>>> lr.coef_.collect()

Perform predict():

>>> result = lgr.predict(data=df_predict, key='ID')
>>> result.collect()

Perform score():

>>> lgr.score(data=df_score, key='ID')
Attributes:
coef_DataFrame

Values of the coefficients.

result_DataFrame

Model content.

optim_param_DataFrame

The optimal parameter set selected via cross-validation. Empty if cross-validation is not activated.

stat_DataFrame

Statistics.

pmml_DataFrame

PMML model. Set to None if no PMML model was requested. In multi-class logistic regression, Please use semistructured_result_ shown below to get the model in PMMl or JSON format.

semistructured_result_DataFrame

A multi-class logistic regression model in PMML or JSON format.

Methods

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

fit(data[, key, features, label, ...])

Fit the LR model when given training dataset.

get_model_metrics()

Get the model metrics.

get_score_metrics()

Get the score metrics.

predict(data[, key, features, ...])

Predict dependent variable values based on a fitted model.

score(data[, key, features, label, ...])

Return the mean accuracy on the given test data and labels.

set_model_state(state)

Set the model state by state information.

fit(data, key=None, features=None, label=None, categorical_variable=None, class_map0=None, class_map1=None)

Fit the LR model when given training dataset.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

If key is not provided, then:

  • if data is indexed by a single column, then key defaults to that index column;

  • otherwise, it is assumed that data contains no ID column.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the label column.

If label is not provided, it defaults to the last column.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

class_map0str, optional

Categorical label to map to 0.

class_map0 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

class_map1str, optional

Categorical label to map to 1.

class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

Returns:
A fitted object of class "LogisticRegression".
predict(data, key=None, features=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None, verbose=False, ignore_unknown_category=None, verbose_top_n=None)

Predict dependent variable values based on a fitted model.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

verbosebool, optional

If True, output scoring probabilities for each class.

It is only applicable for multi-class logistic regression.

Defaults to False.

categorical_variablestr or a list of str, optional (deprecated)

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

class_map0str, optional

Categorical label to map to 0.

class_map0 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

class_map1str, optional

Categorical label to map to 1.

class_map1 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

ignore_unknown_categorybool, optional

Specifies whether or not to ignore unknown category value.

  • False : Report error if unknown category value is found.

  • True : Ignore unknown category value if there is any.

Valid only for multi-class logistic regression.

Defaults to True.

verbose_top_nbool, optional

Specifies the number of top n classes to present after sorting with confidences. It cannot exceed the number of classes in label of the training data, and it can be 0, which means to output the confidences of all classes.

Effective only when verbose is set as True and only for multi-class logistic regression.

Defaults to 0.

Returns:
DataFrame

Predicted result, structured as follows:

  • Column 1: ID

  • Column 2: Predicted class label

  • Column 3: PROBABILITY, type DOUBLE

    • for multi-class: probability of being predicted as the predicted class.

    • for binary-class: probability of being predicted as the positive class.

score(data, key=None, features=None, label=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None)

Return the mean accuracy on the given test data and labels.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the label column.

If label is not provided, it defaults to the last column.

categorical_variablestr or a list of str, optional (deprecated)

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

class_map0str, optional

Categorical label to map to 0.

class_map0 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

class_map1str, optional

Categorical label to map to 1.

class_map1 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

Returns:
float

Scalar accuracy value after comparing the predicted label and original label.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters:
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for CRF.

pal_funcnameint or str, optional

PAL function name. Must be a valid PAL procedure that supports model state.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

set_model_state(state)

Set the model state by state information.

Parameters:
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

delete_model_state(state=None)

Delete PAL model state.

Parameters:
stateDataFrame, optional

Specified the state.

Defaults to self.state.

get_model_metrics()

Get the model metrics.

Returns:
DataFrame

The model metrics.

get_score_metrics()

Get the score metrics.

Returns:
DataFrame

The score metrics.

Inherited Methods from PALBase

Besides those methods mentioned above, the LogisticRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.