LogisticRegression
- class hana_ml.algorithms.pal.linear_model.LogisticRegression(multi_class=False, max_iter=None, pmml_export=None, categorical_variable=None, standardize=True, stat_inf=False, solver=None, enet_alpha=None, enet_lambda=None, tol=None, epsilon=None, thread_ratio=None, max_pass_number=None, sgd_batch_number=None, precompute=None, handle_missing=None, resampling_method=None, metric=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, lbfgs_m=None, class_map0=None, class_map1=None, progress_indicator_id=None, param_values=None, param_range=None, json_export=None, resource=None, max_resource=None, min_resource_rate=None, reduction_rate=None, aggressive_elimination=None, ps_verbose=None)
Logistic regression model that handles binary-class and multi-class classification problems.
- Parameters
- multi_classbool, optional
If True, perform multi-class classification. Otherwise, there must be only two classes.
Defaults to False.
- max_iterint, optional
Maximum number of iterations taken for the solvers to converge. If convergence is not reached after this number, an error will be generated.
When
solver
is 'newton' or 'lbfgs', the convex optimizer may return suboptimal results after the maximum number of iterations. Whensolver
is 'cyclical', if convergence is not reached after the maximum number of passes over training data, an error will be generated.multi-class: Defaults to 100.
binary-class: Defaults to 100000 when
solver
is 'cyclical', 1000 whensolver
is 'proximal', otherwise 100.
- pmml_export{'no', 'single-row', 'multi-row'}, optional
Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.
multi-class
'no' or not provided: No PMML model.
'multi-row': Exports logistic regression model in PMML.
binary-class
'no' or not provided: No PMML model.
'single-row': Exports a PMML model in a maximum of one row. Fails if the model doesn't fit in one row.
'multi-row': Exports a PMML model, splitting it across multiple rows if it doesn't fit in one.
In multi-class, both PMML and JSON format model can be exported. JSON format is preferred if both formats are to be exported.
Defaults to 'no'.
- categorical_variablestr or list of str, optional(deprecated)
Specifies INTEGER column(s) in the data that should be treated category variable.
- standardizebool, optional
If true, standardize the data to have zero mean and unit variance.
Defaults to True.
- stat_infbool, optional
If true, proceed with statistical inference.
Defaults to False.
- solver{'auto', 'newton', 'cyclical', 'lbfgs', 'stochastic', 'proximal'}, optional
Optimization algorithm.
'auto' : automatically determined by system based on input data and parameters.
'newton': Newton iteration method, can only solve ridge regression problems.
'cyclical': Cyclical coordinate descent method to fit elastic net regularized logistic regression.
'lbfgs': LBFGS method(recommended when having many independent variables, can only solve ridge regression problems when
multi_class
is True).'stochastic': Stochastic gradient descent method(recommended when dealing with very large dataset), can only solve ridge regression problems.
'proximal': Proximal gradient descent method to fit elastic net regularized logistic regression.
When
multi_class
is True, only 'auto', 'lbfgs' and 'cyclical' are valid solvers.Defaults to 'auto'.
Note
If it happens that the enet regularization term contains LASSO penalty, while a solver that can only solve ridge regression problems is specified, then the specified solver will be ignored(hence default value is used). The users can check the statistical table for the solver that has been adopted finally.
- enet_alphafloat, optional
The elastic net mixing parameter. The valid value range is between 0 and 1 inclusively(0: Ridge penalty, 1: LASSO penalty).
Defaults to 1.0.
- enet_lambdafloat, optional
Penalized weight. The value should be equal to or greater than 0.
Defaults to 0.0.
- tolfloat, optional
Convergence threshold for exiting iterations.
Defaults to 1.0e-7 when
solver
is cyclical, 1.0e-6 otherwise.- epsilonfloat, optional
Determines the accuracy with which the solution is to be found.
When
solver
is 'lbfgs', the condition is: \(\|g\|\) <epsilon
* max {1, \(\|x\|\)}, where g is gradient of objective function, x is solve of current iteration, and \(\|\cdot\|\) denotes the L2 norm;When
solver
is 'newton', the condition is: \(\|x- x'\|\) <epsilon
* sqrt(n), where x is the solve of current iteration, x' is the previous iteration, and n is the number of features.Only valid when
multi_class
is False and thesolver
is 'newton' or 'lbfgs'.Defaults to 1.0e-6 when
solver
is 'newton', 1.0e-5 whensolver
is 'lbfgs'.- thread_ratiofloat, optional
Controls the proportion of available threads to use for fit() method.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to 1.0.
- max_pass_numberint, optional
The maximum number of passes over the data.
Only valid when
multi_class
is False and (actual)solver
is 'stochastic'.Defaults to 1.
- sgd_batch_numberint, optional
The batch number of Stochastic gradient descent.
Only valid when
multi_class
is False and (actual)solver
is 'stochastic'.Defaults to 1.
- precomputebool, optional
Whether to pre-compute the Gram matrix.
Only valid when
multi_class
is False and (actual)solver
is 'cyclical'.Defaults to True.
- handle_missingbool, optional
True : handle missing values.
False : do not handle missing values.
Only valid when
multi_class
is False.Defaults to True.
- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) in the data that should be treated as categorical.
By default, string is categorical, while int and double are numerical.
- lbfgs_mint, optional
Number of previous updates to keep.
Only applicable when
multi_class
is False andsolver
is 'lbfgs'.Defaults to 6.
- resampling_methodstr, optional
Specifies the resampling method for model evaluation or parameter selection.
Valid resampling methods are listed as follows: 'cv', 'cv_sha', 'cv_hyperband', 'stratified_cv', 'stratified_cv_sha', 'stratified_cv_hyperband', 'bootstrap', 'bootstrap_sha', 'bootstrap_hyperband', 'stratified_bootstrap', 'stratified_bootstrap_sha', 'stratified_bootstrap_hyperband'.
Resampling methods with suffix 'sha' or 'hyperband' are only applicable to parameter selection, and currently these methods cannot be specified when
multi_class
is not True.If no value specified, neither model evaluation nor parameter selection is activated.
- metric{'accuracy', 'f1_score', 'auc', 'nll'}, optional(deprecated)
The evaluation metric used for model evaluation/parameter selection.
Deprecated, please use
evaluation_metric
instead.- evaluation_metric{'accuracy', 'f1_score', 'auc', 'nll'}, optional
The evaluation metric used for model evaluation/parameter selection.
Must be specified together with
resampling_method
to activate model-evaluation/parameter-selection.- fold_numint, optional
The number of folds for cross-validation.
Mandatory and valid only when
resampling_method
is cross-validation based(contains 'cv' in part, e.g. 'cv', 'stratified_cv_sha').- repeat_timesint, optional
The number of repeat times for resampling.
Defaults to 1.
- search_strategy{'grid', 'random'}, optional
The search method for parameter selection.
- random_search_timesint, optional
The number of times to randomly select candidate parameters for selection.
Mandatory and valid when
search_strategy
is 'random'.- random_stateint, optional
The seed for random generation. 0 indicates using system time as seed.
Defaults to 0.
- progress_indicator_idstr, optional
The ID of progress indicator for model evaluation/parameter selection.
Progress indicator deactivated if no value provided.
- param_valuesdict or list of tuples, optional
Specifies values of specific parameters to be selected.
Valid only when
resampling_method
andsearch_strategy
are specified.Specific parameters can be enet_lambda, enet_alpha.
No default value.
- param_rangedict or list of tuples, optional
Specifies range of specific parameters to be selected.
Valid only when
resampling_method
andsearch_strategy
are specified.Specific parameters can be enet_lambda, enet_alpha.
No default value.
- class_map0str, optional (deprecated)
Categorical label to map to 0.
class_map0
is mandatory whenlabel
column type is VARCHAR or NVARCHAROnly valid when
multi_class
is False during binary class fit and score.- class_map1str, optional (deprecated)
Categorical label to map to 1.
class_map1
is mandatory whenlabel
column type is VARCHAR or NVARCHAR during binary class fit and score.Only valid when
multi_class
is False.- json_exportbool, optional
False : Does not export multiple Logistic Regression model in JSON.
True : Exports multiple Logistic Regression model in JSON.
Only valid when multi-class is True.
Currently either PMML or JSON format model can be exported. JSON format is preferred if both formats are to be exported.
Defaults to False.
- resourcestr, optional
Specifies the resource type used in successive-halving and hyperband algorithm for parameter selection:
'max_iter'
'max_pass_number'
Mandatory and valid only when
resampling_method
is specified with suffix 'sha' or 'hyperband'.If
multi_class
is set as True, then currently only 'max_iter' is valid; otherwise ifmulti_class
is False, then'max_pass_number' is valid only when the
actual
solver is 'stochastic''max_iter' is valid for other solvers
- max_resourceint, optional
Maximum allowed resource budget for single hyper-parameter candidate, must be greater than 0.
Mandatory and valid only wen
resource
is set.- min_resource_ratefloat, optional
Specifies the minimum required resource budget compared to maximum resource for single hyper-parameter candidate. Valid value should be greater than or equal to 0, but less than 1.
Valid only when
resource
is set.Defaults to 0.
- reduction_ratefloat, optional
Specifies the reduction rate of available size of hyper-parameter candidates. For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0
Valid only when
resource
is set.Defaults to 3.0.
- aggressive_eliminationbool, optional
Specifies whether to perform aggressive elimination behavior for successive-halving algorithm or not.
When set to True, it will eliminate more parameter candidates than expected(defined via
reduction_rate
). This can enhance the run-time performance but could result in sub-optimal hyper-parameter candidate.Valid only when
resampling_method
is specified with suffix 'sha'.Defaults to False.
- ps_verbosebool, optional
Specifies whether to output optimal hyper-parameter and all evaluation statistics of related hyper-parameter candidates in attribute statistics_ or not.
Defaults to True.
Examples
Training data:
>>> df.collect() V1 V2 V3 CATEGORY 0 B 2.620 0 1 1 B 2.875 0 1 2 A 2.320 1 1 3 A 3.215 2 0 4 B 3.440 3 0 5 B 3.460 0 0 6 A 3.570 1 0 7 B 3.190 2 0 8 A 3.150 3 0 9 B 3.440 0 0 10 B 3.440 1 0 11 A 4.070 3 0 12 A 3.730 1 0 13 B 3.780 2 0 14 B 5.250 2 0 15 A 5.424 3 0 16 A 5.345 0 0 17 B 2.200 1 1 18 B 1.615 2 1 19 A 1.835 0 1 20 B 2.465 3 0 21 A 3.520 1 0 22 A 3.435 0 0 23 B 3.840 2 0 24 B 3.845 3 0 25 A 1.935 1 1 26 B 2.140 0 1 27 B 1.513 1 1 28 A 3.170 3 1 29 B 2.770 0 1 30 B 3.570 0 1 31 A 2.780 3 1
Create LogisticRegression instance and call fit:
>>> lr = linear_model.LogisticRegression(solver='newton', ... thread_ratio=0.1, max_iter=1000, ... pmml_export='single-row', ... stat_inf=True, tol=0.000001) >>> lr.fit(data=df, features=['V1', 'V2', 'V3'], ... label='CATEGORY', categorical_variable=['V3']) >>> lr.coef_.collect() VARIABLE_NAME COEFFICIENT 0 __PAL_INTERCEPT__ 17.044785 1 V1__PAL_DELIMIT__A 0.000000 2 V1__PAL_DELIMIT__B -1.464903 3 V2 -4.819740 4 V3__PAL_DELIMIT__0 0.000000 5 V3__PAL_DELIMIT__1 -2.794139 6 V3__PAL_DELIMIT__2 -4.807858 7 V3__PAL_DELIMIT__3 -2.780918 8 {"CONTENT":"{\"impute_model\":{\"column_statis... NaN >>> pred_df.collect() ID V1 V2 V3 0 0 B 2.620 0 1 1 B 2.875 0 2 2 A 2.320 1 3 3 A 3.215 2 4 4 B 3.440 3 5 5 B 3.460 0 6 6 A 3.570 1 7 7 B 3.190 2 8 8 A 3.150 3 9 9 B 3.440 0 10 10 B 3.440 1 11 11 A 4.070 3 12 12 A 3.730 1 13 13 B 3.780 2 14 14 B 5.250 2 15 15 A 5.424 3 16 16 A 5.345 0 17 17 B 2.200 1
Call predict():
>>> result = lgr.predict(data=pred_df, ... key='ID', ... categorical_variable=['V3'], ... thread_ratio=0.1) >>> result.collect() ID CLASS PROBABILITY 0 0 1 9.503618e-01 1 1 1 8.485210e-01 2 2 1 9.555861e-01 3 3 0 3.701858e-02 4 4 0 2.229129e-02 5 5 0 2.503962e-01 6 6 0 4.945832e-02 7 7 0 9.922085e-03 8 8 0 2.852859e-01 9 9 0 2.689207e-01 10 10 0 2.200498e-02 11 11 0 4.713726e-03 12 12 0 2.349803e-02 13 13 0 5.830425e-04 14 14 0 4.886177e-07 15 15 0 6.938072e-06 16 16 0 1.637820e-04 17 17 1 8.986435e-01
Input data for score():
>>> df_score.collect() ID V1 V2 V3 CATEGORY 0 0 B 2.620 0 1 1 1 B 2.875 0 1 2 2 A 2.320 1 1 3 3 A 3.215 2 0 4 4 B 3.440 3 0 5 5 B 3.460 0 0 6 6 A 3.570 1 1 7 7 B 3.190 2 0 8 8 A 3.150 3 0 9 9 B 3.440 0 0 10 10 B 3.440 1 0 11 11 A 4.070 3 0 12 12 A 3.730 1 0 13 13 B 3.780 2 0 14 14 B 5.250 2 0 15 15 A 5.424 3 0 16 16 A 5.345 0 0 17 17 B 2.200 1 1
Call score():
>>> lgr.score(data=df_score, ... key='ID', ... categorical_variable=['V3'], ... thread_ratio=0.1) 0.944444
- Attributes
- coef_DataFrame
Values of the coefficients.
- result_DataFrame
Model content.
- optim_param_DataFrame
The optimal parameter set selected via cross-validation. Empty if cross-validation is not activated.
- stat_DataFrame
Statistics info for the trained model, structured as follows:
1st column: 'STAT_NAME', NVARCHAR(256)
2nd column: 'STAT_VALUE', NVARCHAR(1000)
- pmml_DataFrame
PMML model. Set to None if no PMML model was requested. In multi-class logistic regression, Please use semistructured_result_ shown below to get the model in PMMl or JSON format.
- semistructured_result_DataFrame
A multi-class logistic regression model in PMML or JSON format.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, label, ...])Fit the LR model when given training dataset.
predict
(data[, key, features, ...])Predict with the dataset using the trained model.
score
(data[, key, features, label, ...])Return the mean accuracy on the given test data and labels.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, label=None, categorical_variable=None, class_map0=None, class_map1=None)
Fit the LR model when given training dataset.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the label column.
If
label
is not provided, it defaults to the last column.- categorical_variablestr or list of str, optional
Specifies INTEGER column(s) that should be treated as categorical.
Otherwise All INTEGER columns are treated as numerical.
- class_map0str, optional
Categorical label to map to 0.
class_map0
is mandatory whenlabel
column type is VARCHAR or NVARCHAR during binary class fit and score.Only valid if
multi_class
is not set to True when initializing the class instance.- class_map1str, optional
Categorical label to map to 1.
class_map1
is mandatory whenlabel
column type is VARCHAR or NVARCHAR during binary class fit and score.Only valid if
multi_class
is not set to True when initializing the class instance.
- Returns
- LogisticRegression
A fitted object.
- predict(data, key=None, features=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None, verbose=False, ignore_unknown_category=None)
Predict with the dataset using the trained model.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- verbosebool, optional
If True, output scoring probabilities for each class.
It is only applicable for multi-class case.
Defaults to False.
- categorical_variablestr or list of str, optional (deprecated)
Specifies INTEGER column(s) that should be treated as categorical.
Otherwise all integer columns are treated as numerical.
Mandatory if training data of the prediction model contains such data columns.
- thread_ratiofloat, optional
Controls the proportion of available threads to use.
The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.
Values outside this range tell pal to heuristically determine the number of threads to use.
Defaults to 0.
- class_map0str, optional
Categorical label to map to 0.
class_map0
is mandatory whenlabel
column type is varchar or nvarchar during binary class fit and score.Only valid if
multi_class
is not set to True when initializing the class instance.- class_map1str, optional
Categorical label to map to 1.
class_map1
is mandatory whenlabel
column type is varchar or nvarchar during binary class fit and score.Only valid if
multi_class
is not set to True when initializing the class instance.- ignore_unknown_categorybool, optional
Specifies whether or not to ignore unknown category value.
False : Report error if unknown category value is found.
True : Ignore unknown category value if there is any.
Valid only for multi-class logistic regression.
Defaults to True.
- Returns
- DataFrame
Predicted result, structured as follows:
1: ID column, with predicted class name.
2: PROBABILITY, type DOUBLE
multi-class: probability of being predicted as the predicted class.
binary-class: probability of being predicted as the positive class.
- score(data, key=None, features=None, label=None, categorical_variable=None, thread_ratio=None, class_map0=None, class_map1=None)
Return the mean accuracy on the given test data and labels.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the label column.
If
label
is not provided, it defaults to the last column.- categorical_variablestr or list of str, optional (deprecated)
Specifies INTEGER columns that should be treated as categorical, otherwise all integer columns are treated as numerical.
Mandatory if training data of the prediction model contains such data columns.
- thread_ratiofloat, optional
Controls the proportion of available threads to use.
The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.
values outside this range tell pal to heuristically determine the number of threads to use.
Defaults to 0.
- class_map0str, optional
Categorical label to map to 0.
class_map0
is mandatory whenlabel
column type is varchar or nvarchar during binary class fit and score.Only valid if
multi_class
is not set to True when initializing the class instance.- class_map1str, optional
Categorical label to map to 1.
class_map1
is mandatory whenlabel
column type is varchar or nvarchar during binary class fit and score.Only valid if
multi_class
is not set to True when initializing the class instance.
- Returns
- float
Scalar accuracy value after comparing the predicted label and original label.
- create_model_state(model=None, function=None, pal_funcname=None, state_description=None, force=False)
Create PAL model state.
- Parameters
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for CRF.
- pal_funcnameint or str, optional
PAL function name. Must be a valid PAL procedure that supports model state.
Defaults to self.pal_funcname.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
Set the model state by state information.
- Parameters
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
Inherited Methods from PALBase
Besides those methods mentioned above, the LogisticRegression class also inherits methods from PALBase class, please refer to PAL Base for more details.