Naive Bayes

hanaml.NaiveBayes is a R wrapper for SAP HANA PAL Naive Bayes.

hanaml.NaiveBayes(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  formula = NULL,
  alpha = NULL,
  discretization = NULL,
  model.format = NULL,
  categorical.variable = NULL,
  thread.ratio = NULL,
  resampling.method = NULL,
  evaluation.metric = NULL,
  fold.num = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  random.state = NULL,
  timeout = NULL,
  progress.indicator.id = NULL,
  parameter.range = NULL,
  parameter.values = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
key	`character, optional` Name of the ID column. If not provided, the data is assumed to have no ID column. No default value.
features	`character of list of characters, optional` Name of feature columns. If not provided, it defaults all non-key, non-label columns of data.
label	`character, optional` Name of the column which specifies the dependent variable. Defaults to the last column of data if not provided.
formula	`formula type, optional` Formula to be used for model generation. format = label~<feature_list> e.g.: formula=CATEGORY~V1+V2+V3 You can either give the formula, or a feature and label combination, but do not provide both. Defaults to NULL.
alpha	`double, optional` Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing. Defaults to 0.
discretization	`('no', 'supervised'), optional` Discretize continuous attributes. 'no': disable discretization. 'supervised': use supervised discretization on all the continuous attributes. Defaults to 'no'.
model.format	`c('json', 'pmml'), optional` Controls whether to output the model in JSON format or PMML format. 'json': JSON format. 'pmml': PMML format. Defaults to json. Defaults to 'json'.
categorical.variable	`character or list/vector of characters, optional` Indicates features should be treated as categorical variable. The default behavior is dependent on what input is given: "VARCHAR" and "NVARCHAR": categorical "INTEGER" and "DOUBLE": continuous. VALID only for variables of "INTEGER" type, omitted otherwise. No default value.
thread.ratio	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored. Defaults to 0.
resampling.method	`character, optional` Specifies the resampling values form below list. Valid resampling methods include: "cv", "stratified_cv", "bootstrap", "stratified_bootstrap". If no value is specifier, neither model evaluation nor parameter selection is activated.
evaluation.metric	`character, optional` Specifies the evaluation metric for model evaluation or parameter selection. Currently valid evaluation metrics include: "accuracy", "f1_score", "auc". Mandatory for activating model evaluation/parameter selection.
fold.num	`integer, optional` Specifies the fold number for the cross-validation(cv). Mandatory and valid only when `resampling.method` is "cv" or "stratified_cv".
repeat.times	`numeric, optional` Specifies the number of repeat times for resampling. Defaults to 1.
param.search.strategy	`c("grid", "random"), optional` Specifies the method to activate parameter selection. If not specified, model parameter selection shall not be triggered.
random.search.times	`integer, optional` Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid only when `param.search.strategy` is "random".
random.state	`numeric, optional` Specifies the seed for random generation. Use system time when 0 is specified.
timeout	`integer, optional` Specifies maximum running time for model evaluation or parameter selection in seconds. No timeout when 0 is specified.
progress.indicator.id	`character, optional` Sets an ID of progress indicator for model evaluation or parameter selection. No progress indicator is active if no value is provided.
parameter.range	`list, optional` Specifies range of the following parameter for parameter selection: `alpha`. Parameter range should be specified by 3 numbers in the form of c(start, step, end). Examples: parameter.range <- list(alpha = c(0.01, 0.01, 0.1)), which means taking `alpha` values from 0.01 to 0.1 with 0.01 being the step size, i.e. 0.01, 0.02, 0.03, ..., 0.09, 0.1. If `param.search.strategy` is 'random', then the middle term, i.e. step has no effect and thus can be omitted.
parameter.values	`list, optional` Specifies values of the following parameter for parameter selection: `alpha`. Example: parameter.values <- list(alpha = c(0.001, 0.003, 0.007, 0.01))

Value

Returns a "NaiveBayes" object with following values:

model: DataFrame
Naive Bayes model infomation.
statistics: DataFrame
Statistics infomation.
optim.param: DataFrame
Selected optimal parameters.

Details

Naive Bayes is a classification algorithm based on Bayes theorem. It estimates the class-conditional probability by assuming that the attributes are conditionally independent of one another.

Note

The Laplace value (alpha) is only stored by JSON format models. If the PMML format is chosen, you may need to set the Laplace value (alpha) again in predict() and score().

Examples

Input DataFrame df:

> df$Collect()
   ID  HOMEOWNER MARITALSTATUS  ANNUALINCOME DEFAULTEDBORROWER
1   0        YES        Single         125.0                NO
2   1         NO       Married         100.0                NO
3   2         NO        Single          70.0                NO
4   3        YES       Married         120.0                NO
5   4         NO      Divorced          95.0               YES
6   5         NO       Married          60.0                NO
7   6        YES      Divorced         220.0                NO
8   7         NO        Single          85.0               YES
9   8         NO       Married          75.0                NO
10  9         NO        Single          90.0               YES

Call the function:

> nb <- hanaml.NaiveBayes(data = df, alpha = 1.0,
                          model.format = "pmml", thread.ratio = 0.2,
                          features = list("HOMEOWNER", "MARITALSTATUS", "ANNUALINCOME"),
                          label = "DEFAULTEDBORROWER")

Arguments

Value

Details

Note

Examples

See also