R: Naive Bayes

hanaml.NaiveBayes {hana.ml.r}

R Documentation

Naive Bayes

Description

hanaml.NaiveBayes is a R wrapper for PAL Naive Bayes.

Usage

hanaml.NaiveBayes(conn.context,
                 data = NULL,
                 key = NULL,
                 features = NULL,
                 formula = NULL,
                 label = NULL,
                 alpha =NULL,
                 discretization = NULL,
                 model.format = NULL,
                 categorical.variable = NULL,
                 thread.ratio = NULL)

Arguments

`conn.context`	`ConnectionContext` The connection to the SAP HANA system.
`data`	`DataFrame` DataFrame containing the data.
`key`	`character, optional` Name of the ID column of data. If not specified, then data should have no ID column.
`features`	`list of character, optional` Names of the feature columns. If features is not provided, it defaults to all non-ID, no-label columns.
`formula`	`formula type, optional` Cannot be used along with features and label. If using formula, specify the formula type here.
`label`	`character, optional` Name of the column in data that specifies the dependent variable. If not specified, it defaults the last no-ID column.
`alpha`	`Double, optional` Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing. Defaults to 0.
`discretization`	`c('no', 'supervised'), optional` Discretize continuous attributes. Case-insensitive. - 'no' or not provided: disable discretization. - 'supervised': use supervised discretization on all the continuous attributes. Defaults to no.
`model.format`	`c('json', 'pmml'), optional` Controls whether to output the model in JSON format or PMML format. - 'json' or not provided: JSON format. - 'pmml': PMML format. Defaults to json.
`categorical.variable`	`ListOfStrings, optional` INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.
`thread.ratio`	`double, optional` Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

Format

R6Class object.

Details

Naive Bayes is a classification algorithm based on Bayes theorem. It estimates the class-conditional probability by assuming that the attributes are conditionally independent of one another.

Value

Return a "NaiveBayes" object with following values:

model: DataFrame
Naive Bayes model infomation.
statistics: DataFrame
Statistics infomation.

Note

The Laplace value (alpha) is only stored by JSON format models. If the PMML format is chosen, you may need to set the Laplace value (alpha) again in predict() and score().

Examples

## Not run: 
Input DataFrame df for training the model:

> df$collect()
ID HOMEOWNER MARITALSTATUS  ANNUALINCOME DEFAULTEDBORROWER
0        YES        Single         125.0               NO
1        NO       Married         100.0                NO
2        NO        Single          70.0                NO
3       YES       Married         120.0                NO
4        NO      Divorced          95.0               YES
5        NO       Married          60.0                NO
6       YES      Divorced         220.0                NO
7        NO        Single          85.0               YES
8        NO       Married          75.0                NO
9        NO        Single          90.0               YES

Training the model:

> nb <- hanaml.NaiveBayes(conn.context = conn, data = df, alpha = 1.0,
                         model.format = "pmml", thread.ratio = 0.2,
                         features = list('HOMEOWNER', 'MARITALSTATUS', 'ANNUALINCOME'),
                         label = "DEFAULTEDBORROWER")

Calculating Mean accuracy on the given test data and labels
can be done using score function.
> nb$score(nb, df1, "ID", alpha=1.0, verbose=True)

Output:
{0.875} Double value -  Mean accuracy on the given test data and labels.

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]