hanaml.Discretize {hana.ml.r}R Documentation

Discretize

Description

It is an enhanced version of binning function which can be applied to table with multiple columns. This function partitions table rows into multiple segments called bins, then applies smoothing methods in each bin of each column respectively.

Usage

Discretize(conn.context, data = NULL,
           key = NULL,  features = NULL,
           binning.variable = NULL,  strategy = NULL,
           smoothing = NULL, col.smoothing = NULL, n.bins = NULL,
           bin.size = NULL, n.sd = NULL, categorical.variable = NULL,
           save.model = NULL)

Arguments

conn.context

ConnectionContext
The connection to the SAP HANA system.

data

DataFrame
DataFrame containing the data.

key

character
Name of the ID column.

features

list of character, optional
Name of the features column.
Defaults to all non-ID columns.

binning.variable

list of character
Attribute name, to which binning operation is applied.

strategy

character

Binning methods:

  • "uniform.number": equal widths based on the number of bins

  • "uniform.size": equal widths based on the bin width

  • "quantile": equal number of records per bin

  • "sd": mean/ standard deviation bin boundaries

smoothing

character, optional

Default overall smoothing methods:

  • "no": no smoothing

  • "bin.means": smoothing by bin means

  • "bin.medians": smoothing by bin medians

  • "bin.boundaries": smoothing by bin boundaries

No default value.

col.smoothing

: list of characters, optional
Specifies smoothing method for columns, which overwrites the default smoothing method. Each element must be a valid smoothing method, and a name which specifies a column in data.
Suppose data has two columns: ATT1 and ATT2, then we can set the smoothing method for these two columns as follows:

col.smoothing = list("ATT1" = "bin.means", "ATT2" = "bin.boundaries") or equivalently

col.smoothing = c(ATT1 = "bin.eans", ATT2 = "bin.boundaries") Only applies for numerical attributes.
No default value.

n.bins

integer, optional
Number of needed bins.
Defaults to 2.

bin.size

double, optional
Specifies the distance for binning.
Only valid when strategy is 'uniform.size'.
Defaults to 10.

n.sd

integer, optional
Specifies the number of standard deviation at each side of the mean.
Defaults to 1.

categorical.variable

character, optional
Indicates whether a column data is actually corresponding to a category variable even the data type of this column is INTEGER.

save.model

logical, optional

Indicates whether the model is saved.

  • FALSE: not save

  • TRUE: save

Format

R6Class object.

Value

A "Discretize" object with the following attributes:

See Also

transform.Discretize

Examples

## Not run: 
Input DataFrame data for training:
       ID ATT1 ATT2 ATT3 ATT4
   1   1 10.0  100    1    A
   2   2 10.1  101    1    A
   3   3 10.2  100    1    A
   4   4 10.4  103    1    A
   5   5 10.3  100    1    A
   6   6 40.0  400    4    C
   7   7 40.1  402    4    B
   8   8 40.2  400    4    B
   9   9 40.4  402    4    B
   10 10 40.3  400    4    A
   11 11 90.0  900    2    C
   12 12 90.1  903    1    B
   13 13 90.2  901    2    B
   14 14 90.4  900    1    B
   15 15 90.3  900    1    B

 Model traning and a "Discretize" object discretize is returned:
>  discretize <- hanaml.Discretize(conn, data, key = "ID",
                                  features = c("ATT1", "ATT2",  "ATT3",
                                               "ATT4"),
                                   binning.variable = "ATT1",
                                   strategy = "uniform.number",
                                   smoothing = "bin.boundaries",
                                   col.smoothing = list(ATT2  = "bin.means"),
                                   n.bins = 3, categorical.variable = "ATT3")

Expected output:
> discretize$result$Collect()
      ID ATT1  ATT2 ATT3 ATT4
  1   1 10.2 100.8    1    A
  2   2 10.2 100.8    1    A
  3   3 10.2 100.8    1    A
  4   4 10.2 100.8    1    A
  5   5 10.2 100.8    1    A
  6   6 40.2 400.8    4    C
  7   7 40.2 400.8    4    B
  8   8 40.2 400.8    4    B
  9   9 40.2 400.8    4    B
  10 10 40.2 400.8    4    A
  11 11 90.2 900.8    2    C
  12 12 90.2 900.8    1    B
  13 13 90.2 900.8    2    B
  14 14 90.2 900.8    1    B
  15 15 90.2 900.8    1    B

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]