LogisticRegression
- class hana_ml.algorithms.pal.linear_model.LogisticRegression(multi_class=False, max_iter=None, pmml_export=None, categorical_variable=None, standardize=True, stat_inf=False, solver=None, enet_alpha=None, enet_lambda=None, tol=None, epsilon=None, thread_ratio=None, max_pass_number=None, sgd_batch_number=None, precompute=None, handle_missing=None, resampling_method=None, metric=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, lbfgs_m=None, class_map0=None, class_map1=None, progress_indicator_id=None, param_values=None, param_range=None, json_export=None, resource=None, max_resource=None, min_resource_rate=None, reduction_rate=None, aggressive_elimination=None, ps_verbose=None, onehot_min_frequency=None, onehot_max_categories=None)
Logistic regression models the relationship between a dichotomous dependent variable (also known as explained variable) and one or more continuous or categorical independent variables (also known as explanatory variables). It models the log odds of the dependent variable as a linear combination of the independent variables. LogisticRegression handles both binary-class and multi-class classification problems.
- Parameters:
- multi_classbool, optional
If True, perform multi-class classification. Otherwise, there must be only two classes.
Defaults to False.
- max_iterint, optional
Maximum number of iterations taken for the solvers to converge. If convergence is not reached after this number, an error will be generated.
When
solver
is 'newton' or 'lbfgs', the convex optimizer may return suboptimal results after the maximum number of iterations. Whensolver
is 'cyclical', if convergence is not reached after the maximum number of passes over training data, an error will be generated.multi-class: Defaults to 100.
binary-class: Defaults to 100000 when
solver
is 'cyclical', 1000 whensolver
is 'proximal', otherwise 100.
- pmml_export{'no', 'single-row', 'multi-row'}, optional
Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.
multi-class
'no' or not provided: No PMML model.
'multi-row': Exports logistic regression model in PMML.
binary-class
'no' or not provided: No PMML model.
'single-row': Exports a PMML model in a maximum of one row. Fails if the model doesn't fit in one row.
'multi-row': Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.
In multi-class, both PMML and JSON format model can be exported. JSON format is preferred if both formats are to be exported.
Defaults to 'no'.
- categorical_variablestr or a list of str, optional(deprecated)
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- standardizebool, optional
If true, standardize the data to have zero mean and unit variance.
Defaults to True.
- stat_infbool, optional
If true, proceed with statistical inference.
Defaults to False.
- solver{'auto', 'newton', 'cyclical', 'lbfgs', 'stochastic', 'proximal'}, optional
Optimization algorithm.
'auto' : automatically determined by system based on input data and parameters.
'newton': Newton iteration method, can only solve ridge regression problems.
'cyclical': Cyclical coordinate descent method to fit elastic net regularized logistic regression.
'lbfgs': LBFGS method(recommended when having many independent variables, can only solve ridge regression problems when
multi_class
is True).'stochastic': Stochastic gradient descent method(recommended when dealing with very large dataset), can only solve ridge regression problems.
'proximal': Proximal gradient descent method to fit elastic net regularized logistic regression.
When
multi_class
is True, only 'auto', 'lbfgs' and 'cyclical' are valid solvers.Defaults to 'auto'.
Note
If it happens that the enet regularization term contains LASSO penalty, while a solver that can only solve ridge regression problems is specified, then the specified solver will be ignored(hence default value is used). The users can check the statistical table for the solver that has been adopted finally.
- enet_alphafloat, optional
The elastic net mixing parameter. The valid value range is between 0 and 1 inclusively(0: Ridge penalty, 1: LASSO penalty).
Defaults to 1.0.
- enet_lambdafloat, optional
Penalized weight. The value should be equal to or greater than 0.
Defaults to 0.0.
- tolfloat, optional
Convergence threshold for exiting iterations.
Defaults to 1.0e-7 when
solver
is cyclical, 1.0e-6 otherwise.- epsilonfloat, optional
Determines the accuracy with which the solution is to be found.
When
solver
is 'lbfgs', the condition is: \(\|g\|\) <epsilon
* max {1, \(\|x\|\)}, where g is gradient of objective function, x is solve of current iteration, and \(\|\cdot\|\) denotes the L2 norm;When
solver
is 'newton', the condition is: \(\|x- x'\|\) <epsilon
* sqrt(n), where x is the solve of current iteration, x' is the previous iteration, and n is the number of features.Only valid when
multi_class
is False and thesolver
is 'newton' or 'lbfgs'.Defaults to 1.0e-6 when
solver
is 'newton', 1.0e-5 whensolver
is 'lbfgs'.- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.0.
- max_pass_numberint, optional
The maximum number of passes over the data.
Only valid when
multi_class
is False and (actual)solver
is 'stochastic'.Defaults to 1.
- sgd_batch_numberint, optional
The batch number of Stochastic gradient descent.
Only valid when
multi_class
is False and (actual)solver
is 'stochastic'.Defaults to 1.
- precomputebool, optional
Whether to pre-compute the Gram matrix.
Only valid when
multi_class
is False and (actual)solver
is 'cyclical'.Defaults to True.
- handle_missingbool, optional
True : handle missing values.
False : do not handle missing values.
Only valid when
multi_class
is False.Defaults to True.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- lbfgs_mint, optional
Number of previous updates to keep.
Only applicable when
multi_class
is False andsolver
is 'lbfgs'.Defaults to 6.
- resampling_methodstr, optional
Specifies the resampling method for model evaluation or parameter selection.
Valid resampling methods are listed as follows: 'cv', 'cv_sha', 'cv_hyperband', 'stratified_cv', 'stratified_cv_sha', 'stratified_cv_hyperband', 'bootstrap', 'bootstrap_sha', 'bootstrap_hyperband', 'stratified_bootstrap', 'stratified_bootstrap_sha', 'stratified_bootstrap_hyperband'.
Resampling methods with suffix 'sha' or 'hyperband' are only applicable to parameter selection, and currently these methods cannot be specified when
multi_class
is not True.If no value specified, neither model evaluation nor parameter selection is activated.
- metric{'accuracy', 'f1_score', 'auc', 'nll'}, optional(deprecated)
The evaluation metric used for model evaluation/parameter selection.
Deprecated, please use
evaluation_metric
instead.- evaluation_metric{'accuracy', 'f1_score', 'auc', 'nll'}, optional
The evaluation metric used for model evaluation/parameter selection.
Must be specified together with
resampling_method
to activate model-evaluation/parameter-selection.- fold_numint, optional
The number of folds for cross-validation.
Mandatory and valid only when
resampling_method
is cross-validation based(contains 'cv' in part, e.g. 'cv', 'stratified_cv_sha').- repeat_timesint, optional
The number of repeat times for resampling.
Defaults to 1.
- search_strategy{'grid', 'random'}, optional
The search method for parameter selection.
- random_search_timesint, optional
The number of times to randomly select candidate parameters for selection.
Mandatory and valid when
search_strategy
is 'random'.- random_stateint, optional
The seed for random generation. 0 indicates using system time as seed.
Defaults to 0.
- progress_indicator_idstr, optional
The ID of progress indicator for model evaluation/parameter selection.
Progress indicator deactivated if no value provided.
- param_valuesdict or list of tuples, optional
Specifies values of specific parameters to be selected.
Valid only when
resampling_method
andsearch_strategy
are specified.Specific parameters can be enet_lambda, enet_alpha.
No default value.
- param_rangedict or list of tuples, optional
Specifies range of specific parameters to be selected.
Valid only when
resampling_method
andsearch_strategy
are specified.Specific parameters can be enet_lambda, enet_alpha.
No default value.
- class_map0str, optional (deprecated)
Categorical label to map to 0.
class_map0
is mandatory whenlabel
column type is VARCHAR or NVARCHAROnly valid when
multi_class
is False during binary class fit and score.- class_map1str, optional (deprecated)
Categorical label to map to 1.
class_map1
is mandatory whenlabel
column type is VARCHAR or NVARCHAR during binary class fit and score.Only valid when
multi_class
is False.- json_exportbool, optional
False : Does not export multiple Logistic Regression model in JSON.
True : Exports multiple Logistic Regression model in JSON.
Only valid when multi-class is True.
Currently either PMML or JSON format model can be exported. JSON format is preferred if both formats are to be exported.
Defaults to False.
- resourcestr, optional
Specifies the resource type used in successive-halving and hyperband algorithm for parameter selection:
'max_iter'
'max_pass_number'
Mandatory and valid only when
resampling_method
is specified with suffix 'sha' or 'hyperband'.If
multi_class
is set as True, then currently only 'max_iter' is valid; otherwise ifmulti_class
is False, then'max_pass_number' is valid only when the
actual
solver is 'stochastic''max_iter' is valid for other solvers
- max_resourceint, optional
Maximum allowed resource budget for single hyper-parameter candidate, must be greater than 0.
Mandatory and valid only wen
resource
is set.- min_resource_ratefloat, optional
Specifies the minimum required resource budget compared to maximum resource for single hyper-parameter candidate. Valid value should be greater than or equal to 0, but less than 1.
Valid only when
resource
is set.Defaults to 0.
- reduction_ratefloat, optional
Specifies the reduction rate of available size of hyper-parameter candidates. For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0
Valid only when
resource
is set.Defaults to 3.0.
- aggressive_eliminationbool, optional
Specifies whether to perform aggressive elimination behavior for successive-halving algorithm or not.
When set to True, it will eliminate more parameter candidates than expected(defined via
reduction_rate
). This can enhance the run-time performance but could result in sub-optimal hyper-parameter candidate.Valid only when
resampling_method
is specified with suffix 'sha'.Defaults to False.
- ps_verbosebool, optional
Specifies whether to output optimal hyper-parameter and all evaluation statistics of related hyper-parameter candidates in attribute statistics_ or not.
Defaults to True.
- onehot_min_frequencyint, optional
Specifies the minimum frequency below which a category will be considered infrequent. Only available for multiclass.
Defaults to 1.
- onehot_max_categoriesint, optional
Specifies an upper limit to the number of output features for each input feature. It includes the feature that combines infrequent categories. Only available for multiclass.
Defaults to 0.
Examples
>>> lr = linear_model.LogisticRegression(solver='newton', max_iter=1000, pmml_export='single-row', stat_inf=True, tol=0.000001) >>> lr.fit(data=df, features=['V1', 'V2', 'V3'], label='CATEGORY') >>> lr.coef_.collect()
Perform predict():
>>> result = lgr.predict(data=df_predict, key='ID') >>> result.collect()
Perform score():
>>> lgr.score(data=df_score, key='ID')
- Attributes:
- coef_DataFrame
Values of the coefficients.
- result_DataFrame
Model content.
- optim_param_DataFrame
The optimal parameter set selected via cross-validation. Empty if cross-validation is not activated.
- stat_DataFrame
Statistics.
- pmml_DataFrame
PMML model. Set to None if no PMML model was requested. In multi-class logistic regression, Please use semistructured_result_ shown below to get the model in PMMl or JSON format.
- semistructured_result_DataFrame
A multi-class logistic regression model in PMML or JSON format.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, label, ...])Fit the LR model when given training dataset.
Get the model metrics.
Get the score metrics.
predict
(data[, key, features, ...])Predict dependent variable values based on a fitted model.
score
(data[, key, features, label, ...])Return the mean accuracy on the given test data and labels.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, label=None, categorical_variable=None, class_map0=None, class_map1=None)
Fit the LR model when given training dataset.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the label column.
If
label
is not provided, it defaults to the last column.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- class_map0str, optional
Categorical label to map to 0.
class_map0
is mandatory whenlabel
column type is VARCHAR or NVARCHAR during binary class fit and score.Only valid if
multi_class
is not set to True when initializing the class instance.- class_map1str, optional
Categorical label to map to 1.
class_map1
is mandatory whenlabel
column type is VARCHAR or NVARCHAR during binary class fit and score.Only valid if
multi_class
is not set to True when initializing the class instance.
- Returns:
- A fitted object of class "LogisticRegression".
- predict(data, key=None, features=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None, verbose=False, ignore_unknown_category=None, verbose_top_n=None)
Predict dependent variable values based on a fitted model.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- verbosebool, optional
If True, output scoring probabilities for each class.
It is only applicable for multi-class logistic regression.
Defaults to False.
- categorical_variablestr or a list of str, optional (deprecated)
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- class_map0str, optional
Categorical label to map to 0.
class_map0
is mandatory whenlabel
column type is varchar or nvarchar during binary class fit and score.Only valid if
multi_class
is not set to True when initializing the class instance.- class_map1str, optional
Categorical label to map to 1.
class_map1
is mandatory whenlabel
column type is varchar or nvarchar during binary class fit and score.Only valid if
multi_class
is not set to True when initializing the class instance.- ignore_unknown_categorybool, optional
Specifies whether or not to ignore unknown category value.
False : Report error if unknown category value is found.
True : Ignore unknown category value if there is any.
Valid only for multi-class logistic regression.
Defaults to True.
- verbose_top_nbool, optional
Specifies the number of top n classes to present after sorting with confidences. It cannot exceed the number of classes in label of the training data, and it can be 0, which means to output the confidences of all classes.
Effective only when
verbose
is set as True and only for multi-class logistic regression.Defaults to 0.
- Returns:
- DataFrame
Predicted result, structured as follows:
Column 1: ID
Column 2: Predicted class label
Column 3: PROBABILITY, type DOUBLE
for multi-class: probability of being predicted as the predicted class.
for binary-class: probability of being predicted as the positive class.
- score(data, key=None, features=None, label=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None)
Return the mean accuracy on the given test data and labels.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the label column.
If
label
is not provided, it defaults to the last column.- categorical_variablestr or a list of str, optional (deprecated)
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- class_map0str, optional
Categorical label to map to 0.
class_map0
is mandatory whenlabel
column type is varchar or nvarchar during binary class fit and score.Only valid if
multi_class
is not set to True when initializing the class instance.- class_map1str, optional
Categorical label to map to 1.
class_map1
is mandatory whenlabel
column type is varchar or nvarchar during binary class fit and score.Only valid if
multi_class
is not set to True when initializing the class instance.
- Returns:
- float
Scalar accuracy value after comparing the predicted label and original label.
- create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for CRF.
- pal_funcnameint or str, optional
PAL function name. Must be a valid PAL procedure that supports model state.
Defaults to self.pal_funcname.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the LogisticRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.