LogisticRegression

class hana_ml.algorithms.pal.linear_model.LogisticRegression(multi_class=False, max_iter=None, pmml_export=None, categorical_variable=None, standardize=True, stat_inf=False, solver=None, enet_alpha=None, enet_lambda=None, tol=None, epsilon=None, thread_ratio=None, max_pass_number=None, sgd_batch_number=None, precompute=None, handle_missing=None, resampling_method=None, metric=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, lbfgs_m=None, class_map0=None, class_map1=None, progress_indicator_id=None, param_values=None, param_range=None, json_export=None, resource=None, max_resource=None, min_resource_rate=None, reduction_rate=None, aggressive_elimination=None, ps_verbose=None)

Logistic regression model that handles binary-class and multi-class classification problems.

Parameters

multi_classbool, optional

If True, perform multi-class classification. Otherwise, there must be only two classes.

Defaults to False.

max_iterint, optional

Maximum number of iterations taken for the solvers to converge. If convergence is not reached after this number, an error will be generated.

When solver is 'newton' or 'lbfgs', the convex optimizer may return suboptimal results after the maximum number of iterations. When solver is 'cyclical', if convergence is not reached after the maximum number of passes over training data, an error will be generated.

multi-class: Defaults to 100.
binary-class: Defaults to 100000 when solver is 'cyclical', 1000 when solver is 'proximal', otherwise 100.

pmml_export{'no', 'single-row', 'multi-row'}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

multi-class

'no' or not provided: No PMML model.

'multi-row': Exports logistic regression model in PMML.

binary-class

'no' or not provided: No PMML model.

'single-row': Exports a PMML model in a maximum of one row. Fails if the model doesn't fit in one row.

'multi-row': Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.

In multi-class, both PMML and JSON format model can be exported. JSON format is preferred if both formats are to be exported.

Defaults to 'no'.

categorical_variablestr or list of str, optional(deprecated)

Specifies INTEGER column(s) in the data that should be treated category variable.

standardizebool, optional

If true, standardize the data to have zero mean and unit variance.

Defaults to True.

stat_infbool, optional

If true, proceed with statistical inference.

Defaults to False.

solver{'auto', 'newton', 'cyclical', 'lbfgs', 'stochastic', 'proximal'}, optional

Optimization algorithm.

'auto' : automatically determined by system based on input data and parameters.
'newton': Newton iteration method, can only solve ridge regression problems.
'cyclical': Cyclical coordinate descent method to fit elastic net regularized logistic regression.
'lbfgs': LBFGS method(recommended when having many independent variables, can only solve ridge regression problems when multi_class is True).
'stochastic': Stochastic gradient descent method(recommended when dealing with very large dataset), can only solve ridge regression problems.
'proximal': Proximal gradient descent method to fit elastic net regularized logistic regression.

When multi_class is True, only 'auto', 'lbfgs' and 'cyclical' are valid solvers.

Defaults to 'auto'.

Note

If it happens that the enet regularization term contains LASSO penalty, while a solver that can only solve ridge regression problems is specified, then the specified solver will be ignored(hence default value is used). The users can check the statistical table for the solver that has been adopted finally.

enet_alphafloat, optional

The elastic net mixing parameter. The valid value range is between 0 and 1 inclusively(0: Ridge penalty, 1: LASSO penalty).

Defaults to 1.0.

enet_lambdafloat, optional

Penalized weight. The value should be equal to or greater than 0.

Defaults to 0.0.

tolfloat, optional

Convergence threshold for exiting iterations.

Defaults to 1.0e-7 when solver is cyclical, 1.0e-6 otherwise.

epsilonfloat, optional

Determines the accuracy with which the solution is to be found.

When solver is 'lbfgs', the condition is: \(\|g\|\) < epsilon * max {1, \(\|x\|\)}, where g is gradient of objective function, x is solve of current iteration, and \(\|\cdot\|\) denotes the L2 norm;

When solver is 'newton', the condition is: \(\|x- x'\|\) < epsilon * sqrt(n), where x is the solve of current iteration, x' is the previous iteration, and n is the number of features.

Only valid when multi_class is False and the solver is 'newton' or 'lbfgs'.

Defaults to 1.0e-6 when solver is 'newton', 1.0e-5 when solver is 'lbfgs'.

thread_ratiofloat, optional

Controls the proportion of available threads to use for fit() method.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 1.0.

max_pass_numberint, optional

The maximum number of passes over the data.

Only valid when multi_class is False and (actual) solver is 'stochastic'.

Defaults to 1.

sgd_batch_numberint, optional

The batch number of Stochastic gradient descent.

Only valid when multi_class is False and (actual) solver is 'stochastic'.

Defaults to 1.

precomputebool, optional

Whether to pre-compute the Gram matrix.

Only valid when multi_class is False and (actual) solver is 'cyclical'.

Defaults to True.

handle_missingbool, optional

True : handle missing values.
False : do not handle missing values.

Only valid when multi_class is False.

Defaults to True.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) in the data that should be treated as categorical.

By default, string is categorical, while int and double are numerical.

lbfgs_mint, optional

Number of previous updates to keep.

Only applicable when multi_class is False and solver is 'lbfgs'.

Defaults to 6.

resampling_methodstr, optional

Specifies the resampling method for model evaluation or parameter selection.

Valid resampling methods are listed as follows: 'cv', 'cv_sha', 'cv_hyperband', 'stratified_cv', 'stratified_cv_sha', 'stratified_cv_hyperband', 'bootstrap', 'bootstrap_sha', 'bootstrap_hyperband', 'stratified_bootstrap', 'stratified_bootstrap_sha', 'stratified_bootstrap_hyperband'.

Resampling methods with suffix 'sha' or 'hyperband' are only applicable to parameter selection, and currently these methods cannot be specified when multi_class is not True.

If no value specified, neither model evaluation nor parameter selection is activated.

metric{'accuracy', 'f1_score', 'auc', 'nll'}, optional(deprecated)

The evaluation metric used for model evaluation/parameter selection.

Deprecated, please use evaluation_metric instead.

evaluation_metric{'accuracy', 'f1_score', 'auc', 'nll'}, optional

The evaluation metric used for model evaluation/parameter selection.

Must be specified together with resampling_method to activate model-evaluation/parameter-selection.

fold_numint, optional

The number of folds for cross-validation.

Mandatory and valid only when resampling_method is cross-validation based(contains 'cv' in part, e.g. 'cv', 'stratified_cv_sha').

repeat_timesint, optional

The number of repeat times for resampling.

Defaults to 1.

search_strategy{'grid', 'random'}, optional

The search method for parameter selection.

random_search_timesint, optional

The number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is 'random'.

random_stateint, optional

The seed for random generation. 0 indicates using system time as seed.

Defaults to 0.

progress_indicator_idstr, optional

The ID of progress indicator for model evaluation/parameter selection.

Progress indicator deactivated if no value provided.

param_valuesdict or list of tuples, optional

Specifies values of specific parameters to be selected.

Valid only when resampling_method and search_strategy are specified.

Specific parameters can be enet_lambda, enet_alpha.

No default value.

param_rangedict or list of tuples, optional

Specifies range of specific parameters to be selected.

Valid only when resampling_method and search_strategy are specified.

Specific parameters can be enet_lambda, enet_alpha.

No default value.

class_map0str, optional (deprecated)

Categorical label to map to 0.

class_map0 is mandatory when label column type is VARCHAR or NVARCHAR

Only valid when multi_class is False during binary class fit and score.

class_map1str, optional (deprecated)

Categorical label to map to 1.

class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid when multi_class is False.

json_exportbool, optional

False : Does not export multiple Logistic Regression model in JSON.
True : Exports multiple Logistic Regression model in JSON.

Only valid when multi-class is True.

Currently either PMML or JSON format model can be exported. JSON format is preferred if both formats are to be exported.

Defaults to False.

resourcestr, optional

Specifies the resource type used in successive-halving and hyperband algorithm for parameter selection:

'max_iter'

'max_pass_number'

Mandatory and valid only when resampling_method is specified with suffix 'sha' or 'hyperband'.

If multi_class is set as True, then currently only 'max_iter' is valid; otherwise if multi_class is False, then

'max_pass_number' is valid only when the actual solver is 'stochastic'

'max_iter' is valid for other solvers

max_resourceint, optional

Maximum allowed resource budget for single hyper-parameter candidate, must be greater than 0.

Mandatory and valid only wen resource is set.

min_resource_ratefloat, optional

Specifies the minimum required resource budget compared to maximum resource for single hyper-parameter candidate. Valid value should be greater than or equal to 0, but less than 1.

Valid only when resource is set.

Defaults to 0.

reduction_ratefloat, optional

Specifies the reduction rate of available size of hyper-parameter candidates. For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0

Valid only when resource is set.

Defaults to 3.0.

aggressive_eliminationbool, optional

Specifies whether to perform aggressive elimination behavior for successive-halving algorithm or not.

When set to True, it will eliminate more parameter candidates than expected(defined via reduction_rate). This can enhance the run-time performance but could result in sub-optimal hyper-parameter candidate.

Valid only when resampling_method is specified with suffix 'sha'.

Defaults to False.

ps_verbosebool, optional

Specifies whether to output optimal hyper-parameter and all evaluation statistics of related hyper-parameter candidates in attribute statistics_ or not.

Defaults to True.

Examples

Training data:

>>> df.collect()
   V1     V2  V3  CATEGORY
 B  2.620   0         1
 B  2.875   0         1
 A  2.320   1         1
 A  3.215   2         0
 B  3.440   3         0
 B  3.460   0         0
 A  3.570   1         0
 B  3.190   2         0
 A  3.150   3         0
 B  3.440   0         0
B  3.440   1         0
A  4.070   3         0
A  3.730   1         0
B  3.780   2         0
B  5.250   2         0
A  5.424   3         0
A  5.345   0         0
B  2.200   1         1
B  1.615   2         1
A  1.835   0         1
B  2.465   3         0
A  3.520   1         0
A  3.435   0         0
B  3.840   2         0
B  3.845   3         0
A  1.935   1         1
B  2.140   0         1
B  1.513   1         1
A  3.170   3         1
B  2.770   0         1
B  3.570   0         1
A  2.780   3         1

Create LogisticRegression instance and call fit:

>>> lr = linear_model.LogisticRegression(solver='newton',
...                                      thread_ratio=0.1, max_iter=1000,
...                                      pmml_export='single-row',
...                                      stat_inf=True, tol=0.000001)
>>> lr.fit(data=df, features=['V1', 'V2', 'V3'],
...        label='CATEGORY', categorical_variable=['V3'])
>>> lr.coef_.collect()
                                       VARIABLE_NAME  COEFFICIENT
                                __PAL_INTERCEPT__    17.044785
                               V1__PAL_DELIMIT__A     0.000000
                               V1__PAL_DELIMIT__B    -1.464903
                                               V2    -4.819740
                               V3__PAL_DELIMIT__0     0.000000
                               V3__PAL_DELIMIT__1    -2.794139
                               V3__PAL_DELIMIT__2    -4.807858
                               V3__PAL_DELIMIT__3    -2.780918
{"CONTENT":"{\"impute_model\":{\"column_statis...          NaN
>>> pred_df.collect()
    ID V1     V2  V3
  0  B  2.620   0
  1  B  2.875   0
  2  A  2.320   1
  3  A  3.215   2
  4  B  3.440   3
  5  B  3.460   0
  6  A  3.570   1
  7  B  3.190   2
  8  A  3.150   3
  9  B  3.440   0
10  B  3.440   1
11  A  4.070   3
12  A  3.730   1
13  B  3.780   2
14  B  5.250   2
15  A  5.424   3
16  A  5.345   0
17  B  2.200   1

Call predict():

>>> result = lgr.predict(data=pred_df,
...                      key='ID',
...                      categorical_variable=['V3'],
...                      thread_ratio=0.1)
>>> result.collect()
    ID CLASS   PROBABILITY
  0     1  9.503618e-01
  1     1  8.485210e-01
  2     1  9.555861e-01
  3     0  3.701858e-02
  4     0  2.229129e-02
  5     0  2.503962e-01
  6     0  4.945832e-02
  7     0  9.922085e-03
  8     0  2.852859e-01
  9     0  2.689207e-01
10     0  2.200498e-02
11     0  4.713726e-03
12     0  2.349803e-02
13     0  5.830425e-04
14     0  4.886177e-07
15     0  6.938072e-06
16     0  1.637820e-04
17     1  8.986435e-01

Input data for score():

>>> df_score.collect()
    ID V1     V2  V3  CATEGORY
  0  B  2.620   0         1
  1  B  2.875   0         1
  2  A  2.320   1         1
  3  A  3.215   2         0
  4  B  3.440   3         0
  5  B  3.460   0         0
  6  A  3.570   1         1
  7  B  3.190   2         0
  8  A  3.150   3         0
  9  B  3.440   0         0
10  B  3.440   1         0
11  A  4.070   3         0
12  A  3.730   1         0
13  B  3.780   2         0
14  B  5.250   2         0
15  A  5.424   3         0
16  A  5.345   0         0
17  B  2.200   1         1

Call score():

>>> lgr.score(data=df_score,
...           key='ID',
...           categorical_variable=['V3'],
...           thread_ratio=0.1)
0.944444

Attributes

coef_DataFrame

Values of the coefficients.

result_DataFrame

Model content.

optim_param_DataFrame

The optimal parameter set selected via cross-validation. Empty if cross-validation is not activated.

stat_DataFrame

Statistics info for the trained model, structured as follows:

1st column: 'STAT_NAME', NVARCHAR(256)

2nd column: 'STAT_VALUE', NVARCHAR(1000)

pmml_DataFrame

PMML model. Set to None if no PMML model was requested. In multi-class logistic regression, Please use semistructured_result_ shown below to get the model in PMMl or JSON format.

semistructured_result_DataFrame

A multi-class logistic regression model in PMML or JSON format.

Methods

`create_model_state`([model, function, ...])	Create PAL model state.
`delete_model_state`([state])	Delete PAL model state.
`fit`(data[, key, features, label, ...])	Fit the LR model when given training dataset.
`predict`(data[, key, features, ...])	Predict with the dataset using the trained model.
`score`(data[, key, features, label, ...])	Return the mean accuracy on the given test data and labels.
`set_model_state`(state)	Set the model state by state information.

fit(data, key=None, features=None, label=None, categorical_variable=None, class_map0=None, class_map1=None)

Fit the LR model when given training dataset.

Parameters

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;
otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the label column.

If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Otherwise All INTEGER columns are treated as numerical.

class_map0str, optional

Categorical label to map to 0.

class_map0 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

class_map1str, optional

Categorical label to map to 1.

class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

Returns

LogisticRegression: A fitted object.

predict(data, key=None, features=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None, verbose=False, ignore_unknown_category=None)

Predict with the dataset using the trained model.

Parameters

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

verbosebool, optional

If True, output scoring probabilities for each class.

It is only applicable for multi-class case.

Defaults to False.

categorical_variablestr or list of str, optional (deprecated)

Specifies INTEGER column(s) that should be treated as categorical.

Otherwise all integer columns are treated as numerical.

Mandatory if training data of the prediction model contains such data columns.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside this range tell pal to heuristically determine the number of threads to use.

Defaults to 0.

class_map0str, optional

Categorical label to map to 0.

class_map0 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

class_map1str, optional

Categorical label to map to 1.

class_map1 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

ignore_unknown_categorybool, optional

Specifies whether or not to ignore unknown category value.

False : Report error if unknown category value is found.

True : Ignore unknown category value if there is any.

Valid only for multi-class logistic regression.

Defaults to True.

Returns

DataFrame

Predicted result, structured as follows:

1: ID column, with predicted class name.

2: PROBABILITY, type DOUBLE

multi-class: probability of being predicted as the predicted class.

binary-class: probability of being predicted as the positive class.

score(data, key=None, features=None, label=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None)

Return the mean accuracy on the given test data and labels.

Parameters

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the label column.

If label is not provided, it defaults to the last column.

categorical_variablestr or list of str, optional (deprecated)

Specifies INTEGER columns that should be treated as categorical, otherwise all integer columns are treated as numerical.

Mandatory if training data of the prediction model contains such data columns.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

values outside this range tell pal to heuristically determine the number of threads to use.

Defaults to 0.

class_map0str, optional

Categorical label to map to 0.

class_map0 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

class_map1str, optional

Categorical label to map to 1.

class_map1 is mandatory when label column type is varchar or nvarchar during binary class fit and score.

Only valid if multi_class is not set to True when initializing the class instance.

Returns

float: Scalar accuracy value after comparing the predicted label and original label.

create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)

Create PAL model state.

Parameters

modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for CRF.

pal_funcnameint or str, optional

PAL function name. Must be a valid PAL procedure that supports model state.

Defaults to self.pal_funcname.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

set_model_state(state)

Set the model state by state information.

Parameters

state: DataFrame or dict

If state is DataFrame, it has the following structure:

NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

delete_model_state(state=None)

Delete PAL model state.

Parameters

stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the LogisticRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.