Naive Bayes

hanaml.NaiveBayes is a R wrapper for SAP HANA PAL Naive Bayes.

hanaml.NaiveBayes(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  formula = NULL,
  alpha = NULL,
  discretization = NULL,
  model.format = NULL,
  categorical.variable = NULL,
  thread.ratio = NULL,
  resampling.method = NULL,
  evaluation.metric = NULL,
  fold.num = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  random.state = NULL,
  timeout = NULL,
  progress.indicator.id = NULL,
  parameter.range = NULL,
  parameter.values = NULL,
  reduction.rate = NULL,
  aggressive.elimination = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

formula

formula type, optional
Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula, or a feature and label combination, but do not provide both.
Defaults to NULL.

alpha

double, optional
Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.
Defaults to 0.

discretization

c("no", "supervised"), optional
Discretize continuous attributes.

"no": disable discretization.
"supervised": use supervised discretization on all the continuous attributes.

Defaults to "no".

model.format

c("json", "pmml"), optional
Controls whether to output the model in JSON format or PMML format.

"json": JSON format.
"pmml": PMML format. Defaults to json.

Defaults to "json".

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

resampling.method

character, optional
Specifies the resampling method for model evaluation or parameter selection.
Valid options are listed as follows:
"cv", "stratified_cv", "bootstrap", "stratified_bootstrap", "cv_sha", "stratified_cv_sha", "bootstrap_sha", "stratified_bootstrap_sha", "cv_hyperband", "stratified_cv_hyperband", "bootstrap_hyperband", "stratified_bootstrap_hyperband".
Note that resampling methods with suffix "sha" or "hyperband" are only applicable to parameter selection, not model evaluation.
If no value is specified, neither model evaluation nor parameter selection is activated.
No default value.

evaluation.metric

character, optional
Specifies the evaluation metric for model evaluation or parameter selection.
Must be specified together with resampling.method to activate model evaluation or parameter selection.
Currently valid evaluation metrics include: "accuracy", "f1_score", "auc".
Must be specified together with resampling.method to activate model evaluation or parameter selection.
No default value.

fold.num

integer, optional
Specifies the fold number for the cross-validation(cv).
Mandatory and valid only when resampling.method is specified and contains "cv" as substring, e.g. "stratified_cv", "cv_hyperband".

repeat.times

numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.

param.search.strategy

c("grid", "random"), optional
Specifies the method to activate parameter selection.
If not specified, model parameter selection shall not be triggered.
Defaults to "random" and cannot be changed if resampling.method is set as one of the following: "cv_hyperband", "bootstrap_hyperband", "stratified_cv_hyperband", "stratified_bootstrap_hyperband"; otherwise no default value.

random.search.times

integer, optional
Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid only when param.search.strategy is set as "random", or when resampling.method is set as one of the following: "cv_hyperband", "bootstrap_hyperband", "stratified_cv_hyperband", "stratified_bootstrap_hyperband".

random.state

numeric, optional
Specifies the seed for random generation.
Use system time when 0 is specified.

timeout

integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds.
No timeout when 0 is specified.

progress.indicator.id

character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.

parameter.range

list, optional
Specifies range of the following parameter for parameter selection:
alpha.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:

parameter.range <- list(alpha = c(0.01, 0.01, 0.1))

, which means taking alpha values from 0.01 to 0.1 with 0.01 being the step size, i.e. 0.01, 0.02, 0.03, ..., 0.09, 0.1.
If param.search.strategy is 'random', then the middle term, i.e. step has no effect and thus can be omitted.

parameter.values

list, optional
Specifies values of the following parameter for parameter selection:
alpha.
Example:

parameter.values <- list(alpha = c(0.001, 0.003, 0.007, 0.01))

reduction.rate

numeric, optional
Specifies the reduction rate of available size of hyper-parameter candidates.
For each round, the available parameter candidate size will be divided by value of this parameter. Thus valid value for this parameter must be greater than 1.0
Defaults to 3.0.

aggressive.elimination

logical, optional
Specifies whether to perform aggressive elimination behavior for successive-halving algorithm or not.
When set to TRUE, it will eliminate more parameter candidates than expected(defined via reduction.rate).
This can enhance the run-time performance but could result in sub-optimal hyper-parameter candidate.
Valid only when resampling.method is specified with suffix "sha". Defaults to FALSE.

Value

Returns an R6 object of class "NaiveBayes", with following attributes and methods:
Attributes

model: DataFrame
Naive Bayes model information.
statistics: DataFrame
Statistics information.
optim.param: DataFrame
Selected optimal parameters.

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > nb <- hanaml.NaiveBayes(data=df)
   > nb$CreateModelState()

Arguments:

model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model.
algorithm: character
Specifies the PAL algorithm associated with model.
Defaults to self$pal.algorithm.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model.
Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > nb <- hanaml.NaiveBayes(data=df)
   > nb$CreateModelState()

After using the model state for real-time scoring, we can delete the state by calling:


   > nb$DelateModelState()

Arguments:

state: DataFrame
DataFrame containing the state info.
Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Details

Naive Bayes is a classification algorithm based on Bayes theorem. It estimates the class-conditional probability by assuming that the attributes are conditionally independent of one another.

Note

The Laplace value (alpha) is only stored by JSON format models. If the PMML format is chosen, you may need to set the Laplace value (alpha) again in predict() and score().

Examples

Input DataFrame df:


> df$Collect()
   ID  HOMEOWNER MARITALSTATUS  ANNUALINCOME DEFAULTEDBORROWER
1   0        YES        Single         125.0                NO
2   1         NO       Married         100.0                NO
3   2         NO        Single          70.0                NO
4   3        YES       Married         120.0                NO
5   4         NO      Divorced          95.0               YES
6   5         NO       Married          60.0                NO
7   6        YES      Divorced         220.0                NO
8   7         NO        Single          85.0               YES
9   8         NO       Married          75.0                NO
10  9         NO        Single          90.0               YES

Call the function:


> nb <- hanaml.NaiveBayes(data = df, alpha = 1.0,
                          model.format = "pmml", thread.ratio = 0.2,
                          features = list("HOMEOWNER", "MARITALSTATUS", "ANNUALINCOME"),
                          label = "DEFAULTEDBORROWER")

Arguments

Value

Details

Note

Examples

See also