R: Discretize

hanaml.Discretize {hana.ml.r}

R Documentation

Discretize

Description

It is an enhanced version of binning function which can be applied to table with multiple columns. This function partitions table rows into multiple segments called bins, then applies smoothing methods in each bin of each column respectively.

Usage

Discretize(conn.context, data = NULL,
           key = NULL,  features = NULL,
           binning.variable = NULL,  strategy = NULL,
           smoothing = NULL, col.smoothing = NULL, n.bins = NULL,
           bin.size = NULL, n.sd = NULL, categorical.variable = NULL,
           save.model = NULL)

Arguments

`conn.context`	`ConnectionContext` The connection to the SAP HANA system.
`data`	`DataFrame` DataFrame containing the data.
`key`	`character` Name of the ID column.
`features`	`list of character, optional` Name of the features column. Defaults to all non-ID columns.
`binning.variable`	`list of character` Attribute name, to which binning operation is applied.
`strategy`	`character` Binning methods: `"uniform.number"`: equal widths based on the number of bins `"uniform.size"`: equal widths based on the bin width `"quantile"`: equal number of records per bin `"sd"`: mean/ standard deviation bin boundaries
`smoothing`	`character, optional` Default overall smoothing methods: `"no"`: no smoothing `"bin.means"`: smoothing by bin means `"bin.medians"`: smoothing by bin medians `"bin.boundaries"`: smoothing by bin boundaries No default value.
`col.smoothing`	: `list of characters, optional` Specifies smoothing method for columns, which overwrites the default smoothing method. Each element must be a valid smoothing method, and a name which specifies a column in data. Suppose data has two columns: ATT1 and ATT2, then we can set the smoothing method for these two columns as follows: col.smoothing = list("ATT1" = "bin.means", "ATT2" = "bin.boundaries") or equivalently col.smoothing = c(ATT1 = "bin.eans", ATT2 = "bin.boundaries") Only applies for numerical attributes. No default value.
`n.bins`	`integer, optional` Number of needed bins. Defaults to 2.
`bin.size`	`double, optional` Specifies the distance for binning. Only valid when strategy is 'uniform.size'. Defaults to 10.
`n.sd`	`integer, optional` Specifies the number of standard deviation at each side of the mean. Defaults to 1.
`categorical.variable`	`character, optional` Indicates whether a column data is actually corresponding to a category variable even the data type of this column is INTEGER.
`save.model`	`logical, optional` Indicates whether the model is saved. `FALSE`: not save `TRUE`: save

Format

R6Class object.

Value

A "Discretize" object with the following attributes:

result: DataFrame
Discretize results, structured as follows:
- ID: name as shown in input DataFrame.
- FEATURES : data smoothed respectively in each bins
assignment: DataFrame
Assignment results, structured as follows:
- ID: data ID, name as shown in input DataFrame.
- BIN_INDEX : bin index.
model: DataFrame
Model results, structured as follows:
- ROW_INDEX: row index.
- MODEL_CONTENT : model contents.
statistics: DataFrame
Statistic results, structured as follows:
- STAT_NAME: statistic name.
- STAT_VALUE: statistic value.

Examples

## Not run: 
Input DataFrame data for training:
       ID ATT1 ATT2 ATT3 ATT4
   1   1 10.0  100    1    A
   2   2 10.1  101    1    A
   3   3 10.2  100    1    A
   4   4 10.4  103    1    A
   5   5 10.3  100    1    A
   6   6 40.0  400    4    C
   7   7 40.1  402    4    B
   8   8 40.2  400    4    B
   9   9 40.4  402    4    B
   10 10 40.3  400    4    A
   11 11 90.0  900    2    C
   12 12 90.1  903    1    B
   13 13 90.2  901    2    B
   14 14 90.4  900    1    B
   15 15 90.3  900    1    B

 Model traning and a "Discretize" object discretize is returned:
>  discretize <- hanaml.Discretize(conn, data, key = "ID",
                                  features = c("ATT1", "ATT2",  "ATT3",
                                               "ATT4"),
                                   binning.variable = "ATT1",
                                   strategy = "uniform.number",
                                   smoothing = "bin.boundaries",
                                   col.smoothing = list(ATT2  = "bin.means"),
                                   n.bins = 3, categorical.variable = "ATT3")

Expected output:
> discretize$result$Collect()
      ID ATT1  ATT2 ATT3 ATT4
  1   1 10.2 100.8    1    A
  2   2 10.2 100.8    1    A
  3   3 10.2 100.8    1    A
  4   4 10.2 100.8    1    A
  5   5 10.2 100.8    1    A
  6   6 40.2 400.8    4    C
  7   7 40.2 400.8    4    B
  8   8 40.2 400.8    4    B
  9   9 40.2 400.8    4    B
  10 10 40.2 400.8    4    A
  11 11 90.2 900.8    2    C
  12 12 90.2 900.8    1    B
  13 13 90.2 900.8    2    B
  14 14 90.2 900.8    1    B
  15 15 90.2 900.8    1    B

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]