hanaml.NaiveBayes is a R wrapper for SAP HANA PAL Naive Bayes.
hanaml.NaiveBayes(
data = NULL,
key = NULL,
features = NULL,
label = NULL,
formula = NULL,
alpha = NULL,
discretization = NULL,
model.format = NULL,
categorical.variable = NULL,
thread.ratio = NULL,
resampling.method = NULL,
evaluation.metric = NULL,
fold.num = NULL,
repeat.times = NULL,
param.search.strategy = NULL,
random.search.times = NULL,
random.state = NULL,
timeout = NULL,
progress.indicator.id = NULL,
parameter.range = NULL,
parameter.values = NULL
)
Arguments
| data |
DataFrame
DataFrame containting the data.
|
| key |
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
|
| features |
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
|
| label |
character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.
|
| formula |
formula type, optional
Formula to be used for model generation.
format = label~<feature_list>
e.g.: formula=CATEGORY~V1+V2+V3
You can either give the formula,
or a feature and label combination, but do not provide both.
Defaults to NULL.
|
| alpha |
double, optional
Laplace smoothing value. Set a positive value to enable Laplace smoothing
for categorical variables and use that value as the smoothing parameter.
Set value 0 to disable Laplace smoothing.
Defaults to 0.
|
| discretization |
('no', 'supervised'), optional
Discretize continuous attributes.
Defaults to 'no'. |
| model.format |
c('json', 'pmml'), optional
Controls whether to output the model in JSON format or PMML format.
Defaults to 'json'. |
| categorical.variable |
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value. |
| thread.ratio |
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
|
| resampling.method |
character, optional
Specifies the resampling values form below list.
Valid resampling methods include:
"cv", "stratified_cv", "bootstrap", "stratified_bootstrap".
If no value is specifier, neither model evaluation
nor parameter selection is activated.
|
| evaluation.metric |
character, optional
Specifies the evaluation metric for model evaluation or parameter selection.
Currently valid evaluation metrics include: "accuracy", "f1_score", "auc".
Mandatory for activating model evaluation/parameter selection.
|
| fold.num |
integer, optional
Specifies the fold number for the cross-validation(cv).
Mandatory and valid only when resampling.method is "cv" or "stratified_cv".
|
| repeat.times |
numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.
|
| param.search.strategy |
c("grid", "random"), optional
Specifies the method to activate parameter selection.
If not specified, model parameter selection shall not be triggered.
|
| random.search.times |
integer, optional
Specifies the number of times to randomly select candidate parameters for selection.
Mandatory and valid only when param.search.strategy is "random".
|
| random.state |
numeric, optional
Specifies the seed for random generation.
Use system time when 0 is specified.
|
| timeout |
integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds.
No timeout when 0 is specified.
|
| progress.indicator.id |
character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
|
| parameter.range |
list, optional
Specifies range of the following parameter for parameter selection:
alpha.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(alpha = c(0.01, 0.01, 0.1)), which means taking
alpha values from 0.01 to 0.1 with 0.01 being the step size, i.e.
0.01, 0.02, 0.03, ..., 0.09, 0.1.
If param.search.strategy is 'random', then the middle term,
i.e. step has no effect and thus can be omitted.
|
| parameter.values |
list, optional
Specifies values of the following parameter for parameter selection:
alpha.
Example: parameter.values <- list(alpha = c(0.001, 0.003, 0.007, 0.01))
|
Value
Returns a "NaiveBayes" object with following values:
model: DataFrame
Naive Bayes model infomation.
statistics: DataFrame
Statistics infomation.
optim.param: DataFrame
Selected optimal parameters.
Details
Naive Bayes is a classification algorithm based on Bayes theorem.
It estimates the class-conditional probability by assuming that
the attributes are conditionally independent of one another.
Note
The Laplace value (alpha) is only stored by JSON format models.
If the PMML format is chosen, you may need to set the Laplace value (alpha)
again in predict() and score().
Examples
Input DataFrame df:
> df$Collect()
ID HOMEOWNER MARITALSTATUS ANNUALINCOME DEFAULTEDBORROWER
1 0 YES Single 125.0 NO
2 1 NO Married 100.0 NO
3 2 NO Single 70.0 NO
4 3 YES Married 120.0 NO
5 4 NO Divorced 95.0 YES
6 5 NO Married 60.0 NO
7 6 YES Divorced 220.0 NO
8 7 NO Single 85.0 YES
9 8 NO Married 75.0 NO
10 9 NO Single 90.0 YES
Call the function:
> nb <- hanaml.NaiveBayes(data = df, alpha = 1.0,
model.format = "pmml", thread.ratio = 0.2,
features = list("HOMEOWNER", "MARITALSTATUS", "ANNUALINCOME"),
label = "DEFAULTEDBORROWER")
See also