hanaml.Discretize is a R wrapper for SAP HANA PAL Discretize.

hanaml.Discretize(
  data = NULL,
  key = NULL,
  features = NULL,
  binning.variable = NULL,
  strategy = NULL,
  smoothing = NULL,
  col.smoothing = NULL,
  n.bins = NULL,
  bin.size = NULL,
  n.sd = NULL,
  categorical.variable = NULL,
  save.model = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

binning.variable

character
Attribute name, to which binning operation is applied.

strategy

character

  • "uniform.number": equal widths based on the number of bins

  • "uniform.size": equal widths based on the bin width

  • "quantile": equal number of records per bin

  • "sd": mean/ standard deviation bin boundaries

smoothing

character, optional

  • "no": no smoothing

  • "bin.means": smoothing by bin means

  • "bin.medians": smoothing by bin medians

  • "bin.boundaries": smoothing by bin boundaries

Only applies for none-categorical attributes that do not get specified smoothing method by parameter col.smoothing.
No default value.

col.smoothing

list of characters, optional
Specifies smoothing method for columns, which overwrites the default smoothing method. Each element must be a valid smoothing method, and a name which specifies a column in data.
Suppose data has two columns: ATT1 and ATT2, then we can set the smoothing method for these two columns as follows:

  • col.smoothing = list("ATT1" = "bin.means", "ATT2" = "bin.boundaries")

or equivalently

  • col.smoothing = c(ATT1 = "bin.means", ATT2 = "bin.boundaries")

Only applies for numerical attributes.
No default value.

n.bins

integer, optional
Number of needed bins.
Only valid when strategy is "uniform.number" or "quantile".
Defaults to 2.

bin.size

double, optional
Specifies the distance for binning.
Only valid when strategy is 'uniform.size'.
Defaults to 10.

n.sd

integer, optional
Specifies the number of standard deviation at each side of the mean.
Only valid when strategy is 'sd'.
Defaults to 1.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

  • "VARCHAR" and "NVARCHAR": categorical

  • "INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

save.model

logical, optional

  • FALSE: not save

  • TRUE: save

Value

A "Discretize" object with the following attributes:

  • result: DataFrame
    Discretize results, structured as follows:

    • ID : name as shown in input DataFrame

    • FEATURES : data smoothed respectively in each bins

  • assignment: DataFrame
    Assignment results, structured as follows:

    • ID : data ID, name as shown in input DataFrame.

    • BIN_INDEX : bin index.

  • model: DataFrame
    Model results, structured as follows:

    • ROW_INDEX : row index.

    • MODEL_CONTENT : model contents.

  • statistics: DataFrame
    Statistic results, structured as follows:

    • STAT_NAME : statistic name.

    • STAT_VALUE : statistic value.

Details

It is an enhanced version of binning function which can be applied to table with multiple columns. This function partitions table rows into multiple segments called bins, then applies smoothing methods in each bin of each column respectively.

Examples

Input DataFrame data:


> data$Collect()
   ID ATT1 ATT2 ATT3 ATT4
1   1 10.0  100    1    A
2   2 10.1  101    1    A
3   3 10.2  100    1    A
4   4 10.4  103    1    A
5   5 10.3  100    1    A
6   6 40.0  400    4    C
7   7 40.1  402    4    B
8   8 40.2  400    4    B
9   9 40.4  402    4    B
10 10 40.3  400    4    A
11 11 90.0  900    2    C
12 12 90.1  903    1    B
13 13 90.2  901    2    B
14 14 90.4  900    1    B
15 15 90.3  900    1    B

Call the function and a "Discretize" object discretize is returned:


> discretize <- hanaml.Discretize(data,
                                  key = "ID",
                                  features = c("ATT1", "ATT2",  "ATT3", "ATT4"),
                                  binning.variable = "ATT1",
                                  strategy = "uniform.number",
                                  smoothing = "bin.boundaries",
                                  col.smoothing = list(ATT2  = "bin.means"),
                                  n.bins = 3,
                                  categorical.variable = "ATT3")

Expected output:


> discretize$result$Collect()
   ID ATT1  ATT2 ATT3 ATT4
1   1 10.2 100.8    1    A
2   2 10.2 100.8    1    A
3   3 10.2 100.8    1    A
4   4 10.2 100.8    1    A
5   5 10.2 100.8    1    A
6   6 40.2 400.8    4    C
7   7 40.2 400.8    4    B
8   8 40.2 400.8    4    B
9   9 40.2 400.8    4    B
10 10 40.2 400.8    4    A
11 11 90.2 900.8    2    C
12 12 90.2 900.8    1    B
13 13 90.2 900.8    2    B
14 14 90.2 900.8    1    B
15 15 90.2 900.8    1    B