NaiveBayes
- class hana_ml.algorithms.pal.naive_bayes.NaiveBayes(alpha=None, discretization=None, model_format=None, thread_ratio=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, alpha_range=None, alpha_values=None, reduction_rate=None, aggressive_elimination=None)
Naive Bayes is a classification algorithm based on Bayes theorem. It estimates the class-conditional probability by assuming that the attributes are conditionally independent of one another.
- Parameters:
- alphafloat, optional
Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter.
Set value 0 to disable Laplace smoothing.
Defaults to 0.
- discretization{'no', 'supervised'}, optional
Discretize continuous attributes. Case-insensitive.
'no' or not provided: disable discretization.
'supervised': use supervised discretization on all the continuous attributes.
Defaults to 'no'.
- model_format{'json', 'pmml'}, optional
Controls whether to output the model in JSON format or PMML format. Case-insensitive.
'json' or not provided: JSON format.
'pmml': PMML format.
Defaults to 'json'.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- resampling_methodstr, optional
Specifies the resampling method for model evaluation or parameter selection.
Valid options include: 'cv', 'stratified_cv', 'bootstrap', 'stratified_bootstrap', 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha', 'cv_hyperband', 'stratified_cv_hyperband', 'bootstrap_hyperband', 'stratified_bootstrap_hyperband'.
If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.
No default value.
Note
Resampling method with suffix 'sha' or 'hyperband' is used for parameter selection only, not for model evaluation.
- evaluation_metric{'accuracy', 'f1_score', 'auc'}, optional
Specifies the evaluation metric for model evaluation or parameter selection.
Mandatory if model evaluation or parameter selection is expected.
No default value.
- fold_numint, optional
Specifies the fold number for the cross validation method.
Mandatory and valid only when
resampling_method
is set to 'cv' or 'stratified_cv'.No default value.
- repeat_timesint, optional
Specifies the number of repeat times for resampling.
Default to 1.
- search_strategy{'grid', 'random'}
Specifies the parameter search method.
No default value.
- random_search_timesint, optional
Specifies the number of times to randomly select candidate parameters for selection.
Mandatory and valid when
search_strategy
is set to 'random'.No default value.
- random_stateint, optional
Specifies the seed for random generation.
Use system time when 0 is specified.
Default to 0.
- timeoutint, optional
Specifies maximum running time for model evaluation or parameter selection, in seconds.
No timeout when 0 is specified.
Default to 0.
- progress_indicator_idstr, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
No default value.
- alpha_rangelist of numeric values, optional
Specifies the range for candidate
alpha
values for parameter selection.Only valid when
search_strategy
is specified.No default value.
- alpha_valueslist of numeric values, optional
Specifies candidate
alpha
values for parameter selection.Only valid when
search_strategy
is specified.No default value.
- reduction_ratefloat, optional
Specifies reduction rate in SHA or Hyperband method.
For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0
Valid only when
resampling_method
is specified with suffix 'sha' or 'hyperband'(e.g. 'cv_sha', 'stratified_bootstrap_hyperband').Defaults to 3.0.
- aggressive_eliminationbool, optional
Specifies whether to apply aggressive elimination while using SHA method.
Aggressive elimination happens when the data size and parameters size to be searched does not match and there are still bunch of parameters to be searched while data size reaches its upper limits. If aggressive elimination is applied, lower bound of limit of data size will be used multiple times first to reduce number of parameters.
Valid only when
resampling_method
is specified with suffix 'sha'.Defaults to False.
Examples
>>> nb = NaiveBayes(alpha=1.0, model_format='pmml') >>> nb.fit(data=df_train) >>> nb.predict(data=df_predict, key='ID', alpha=1.0, verbose=True).collect()
- Attributes:
- model_DataFrame
Model content.
Note
The Laplace value (alpha) is only stored by JSON format models. If the PMML format is chosen, you may need to set the Laplace value (alpha) again in predict() and score().
- stats_DataFrame
Statistics.
- optim_param_DataFrame
Selected optimal parameters content.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, label, ...])Fit the model to the training dataset.
Get the model metrics.
Get the score metrics.
predict
(data[, key, features, alpha, ...])Predict dependent variable values based on a fitted model.
score
(data[, key, features, label, alpha])Returns the mean accuracy on the given test data and labels.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, label=None, categorical_variable=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
Training data.
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to the last column.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- Returns:
- A fitted object of class "NaiveBayes".
- predict(data, key=None, features=None, alpha=None, verbose=None, verbose_top_n=None)
Predict dependent variable values based on a fitted model.
- Parameters:
- dataDataFrame
Independent variable values to predict for.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID columns.- alphafloat, optional
Laplace smoothing value.
Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter.
Set value 0 to disable Laplace smoothing.
Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.
- verbosebool, optional
If true, output all classes and the corresponding confidences for each data point.
Defaults to False.
- verbose_top_nint, optional
Specifies the number of top n classes to present after sorting with confidences. It cannot exceed the number of classes in label of the training data, and it can be 0, which means to output the confidences of all classes.
Effective only when
verbose
is set as True.Defaults to 0.
- Returns:
- DataFrame
Predicted result.
Note
A non-zero Laplace value (alpha) is required if there exist discrete category values that only occur in the test set. It can be read from JSON models or from the parameter
alpha
in predict(). The Laplace value you set here takes precedence over the values read from JSON models.
- score(data, key=None, features=None, label=None, alpha=None)
Returns the mean accuracy on the given test data and labels.
- Parameters:
- dataDataFrame
Data on which to assess model performance.
- keystr, optional
Name of the ID column.
Mandatory if
data
is not indexed, or the index ofdata
contains multiple columns.Defaults to the single index column of
data
if not provided.- featuresstr or a list of str, optional
Names of the feature columns.
If features is not provided, it defaults to all non-ID, non-label columns.
- labelstr, optional
Name of the dependent variable.
Defaults to the last column.
- alphafloat, optional
Laplace smoothing value.
Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter.
Set value 0 to disable Laplace smoothing.
Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.
- Returns:
- float
Mean accuracy on the given test data and labels.
- create_model_state(model=None, function=None, pal_funcname='PAL_NAIVE_BAYES', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for Naive Bayes.
- pal_funcnameint or str, optional
PAL function name.
Defaults to 'PAL_NAIVE_BAYES'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the NaiveBayes class also inherits methods from PALBase class, please refer to PAL Base for more details.