NaiveBayes

class hana_ml.algorithms.pal.naive_bayes.NaiveBayes(alpha=None, discretization=None, model_format=None, thread_ratio=None, resampling_method=None, evaluation_metric=None, fold_num=None, repeat_times=None, search_strategy=None, random_search_times=None, random_state=None, timeout=None, progress_indicator_id=None, alpha_range=None, alpha_values=None, reduction_rate=None, aggressive_elimination=None)

A classification model based on Bayes' theorem.

Parameters:
alphafloat, optional

Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter.

Set value 0 to disable Laplace smoothing.

Defaults to 0.

discretization{'no', 'supervised'}, optional

Discretize continuous attributes. Case-insensitive.

  • 'no' or not provided: disable discretization.

  • 'supervised': use supervised discretization on all the continuous attributes.

Defaults to 'no'.

model_format{'json', 'pmml'}, optional

Controls whether to output the model in JSON format or PMML format. Case-insensitive.

  • 'json' or not provided: JSON format.

  • 'pmml': PMML format.

Defaults to 'json'.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to 0.

resampling_methodstr, optional

Specifies the resampling method for model evaluation or parameter selection.

Valid options include: 'cv', 'stratified_cv', 'bootstrap', 'stratified_bootstrap', 'cv_sha', 'stratified_cv_sha', 'bootstrap_sha', 'stratified_bootstrap_sha', 'cv_hyperband', 'stratified_cv_hyperband', 'bootstrap_hyperband', 'stratified_bootstrap_hyperband'.

If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.

No default value.

Note

Resampling method with suffix 'sha' or 'hyperband' is used for parameter selection only, not for model evaluation.

evaluation_metric{'accuracy', 'f1_score', 'auc'}, optional

Specifies the evaluation metric for model evaluation or parameter selection.

Mandatory if model evaluation or parameter selection is expected.

No default value.

fold_numint, optional

Specifies the fold number for the cross validation method.

Mandatory and valid only when resampling_method is set to 'cv' or 'stratified_cv'.

No default value.

repeat_timesint, optional

Specifies the number of repeat times for resampling.

Default to 1.

search_strategy{'grid', 'random'}

Specifies the parameter search method.

No default value.

random_search_timesint, optional

Specifies the number of times to randomly select candidate parameters for selection.

Mandatory and valid when search_strategy is set to 'random'.

No default value.

random_stateint, optional

Specifies the seed for random generation.

Use system time when 0 is specified.

Default to 0.

timeoutint, optional

Specifies maximum running time for model evaluation or parameter selection, in seconds.

No timeout when 0 is specified.

Default to 0.

progress_indicator_idstr, optional

Sets an ID of progress indicator for model evaluation or parameter selection.

No progress indicator is active if no value is provided.

No default value.

alpha_rangelist of numeric values, optional

Specifies the range for candidate alpha values for parameter selection.

Only valid when search_strategy is specified.

No default value.

alpha_valueslist of numeric values, optional

Specifies candidate alpha values for parameter selection.

Only valid when search_strategy is specified.

No default value.

reduction_ratefloat, optional

Specifies reduction rate in SHA or Hyperband method.

For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0

Valid only when resampling_method is specified with suffix 'sha' or 'hyperband'(e.g. 'cv_sha', 'stratified_bootstrap_hyperband').

Defaults to 3.0.

aggressive_eliminationbool, optional

Specifies whether to apply aggressive elimination while using SHA method.

Aggressive elimination happens when the data size and parameters size to be searched does not match and there are still bunch of parameters to be searched while data size reaches its upper limits. If aggressive elimination is applied, lower bound of limit of data size will be used multiple times first to reduce number of parameters.

Valid only when resampling_method is specified with suffix 'sha'.

Defaults to False.

Examples

Training data:

>>> df1.collect()
  HomeOwner MaritalStatus  AnnualIncome DefaultedBorrower
0       YES        Single         125.0                NO
1        NO       Married         100.0                NO
2        NO        Single          70.0                NO
3       YES       Married         120.0                NO
4        NO      Divorced          95.0               YES
5        NO       Married          60.0                NO
6       YES      Divorced         220.0                NO
7        NO        Single          85.0               YES
8        NO       Married          75.0                NO
9        NO        Single          90.0               YES

Training the model:

>>> nb = NaiveBayes(alpha=1.0, model_format='pmml')
>>> nb.fit(df1)

Prediction:

>>> df2.collect()
   ID HomeOwner MaritalStatus  AnnualIncome
0   0        NO       Married         120.0
1   1       YES       Married         180.0
2   2        NO        Single          90.0
>>> nb.predict(df2, 'ID', alpha=1.0, verbose=True)
   ID CLASS  CONFIDENCE
0   0    NO   -6.572353
1   0   YES  -23.747252
2   1    NO   -7.602221
3   1   YES -169.133547
4   2    NO   -7.133599
5   2   YES   -4.648640
Attributes:
model_DataFrame

Trained model content.

Note

The Laplace value (alpha) is only stored by JSON format models. If the PMML format is chosen, you may need to set the Laplace value (alpha) again in predict() and score().

stats_DataFrame

Trained statistics content.

optim_param_DataFrame

Selected optimal parameters content.

Methods

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

fit(data[, key, features, label, ...])

Fit classification model based on training data.

predict(data[, key, features, alpha, verbose])

Predict based on fitted model.

score(data[, key, features, label, alpha])

Returns the mean accuracy on the given test data and labels.

set_model_state(state)

Set the model state by state information.

fit(data, key=None, features=None, label=None, categorical_variable=None)

Fit classification model based on training data.

Parameters:
dataDataFrame

Training data.

keystr, optional

Name of the ID column.

If key is not provided, then:

  • if data is indexed by a single column, then key defaults to that index column;

  • otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column.

categorical_variablestr or ListOfStrings, optional

Specifies INTEGER columns that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Returns:
NaiveBayes

A fitted object.

predict(data, key=None, features=None, alpha=None, verbose=None)

Predict based on fitted model.

Parameters:
dataDataFrame

Independent variable values to predict for.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID columns.

alphafloat, optional

Laplace smoothing value.

Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter.

Set value 0 to disable Laplace smoothing.

Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.

verbosebool, optional

If true, output all classes and the corresponding confidences for each data point.

Defaults to False.

Returns:
DataFrame

Predicted result, structured as follows:

  • ID column, with the same name and type as data 's ID column.

  • CLASS, type NVARCHAR, predicted class name.

  • CONFIDENCE, type DOUBLE, confidence for the prediction of the sample, which is a logarithmic value of the posterior probabilities.

Note

A non-zero Laplace value (alpha) is required if there exist discrete category values that only occur in the test set. It can be read from JSON models or from the parameter alpha in predict(). The Laplace value you set here takes precedence over the values read from JSON models.

score(data, key=None, features=None, label=None, alpha=None)

Returns the mean accuracy on the given test data and labels.

Parameters:
dataDataFrame

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last column.

alphafloat, optional

Laplace smoothing value.

Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter.

Set value 0 to disable Laplace smoothing.

Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.

Returns:
float

Mean accuracy on the given test data and labels.

create_model_state(model=None, function=None, pal_funcname='PAL_NAIVE_BAYES', state_description=None, force=False)

Create PAL model state.

Parameters:
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for Naive Bayes.

pal_funcnameint or str, optional

PAL function name.

Defaults to 'PAL_NAIVE_BAYES'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

set_model_state(state)

Set the model state by state information.

Parameters:
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

delete_model_state(state=None)

Delete PAL model state.

Parameters:
stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the NaiveBayes class also inherits methods from PALBase class, please refer to PAL Base for more details.