Discretize

hanaml.Discretize is a R wrapper for SAP HANA PAL Discretize.

hanaml.Discretize(
  data = NULL,
  key = NULL,
  features = NULL,
  binning.variable = NULL,
  strategy = NULL,
  smoothing = NULL,
  col.smoothing = NULL,
  n.bins = NULL,
  bin.size = NULL,
  n.sd = NULL,
  categorical.variable = NULL,
  save.model = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
key	`character` Name of the ID column.
features	`character of list of characters, optional` Name of feature columns. If not provided, it defaults all non-key, non-label columns of data.
binning.variable	`list of character` Attribute name, to which binning operation is applied.
strategy	`character` `"uniform.number"`: equal widths based on the number of bins `"uniform.size"`: equal widths based on the bin width `"quantile"`: equal number of records per bin `"sd"`: mean/ standard deviation bin boundaries
smoothing	`character, optional` `"no"`: no smoothing `"bin.means"`: smoothing by bin means `"bin.medians"`: smoothing by bin medians `"bin.boundaries"`: smoothing by bin boundaries Only applies for none-categorical attributes that do not get specified smoothing method by parameter col.smoothing. No default value.
col.smoothing	: `list of characters, optional` Specifies smoothing method for columns, which overwrites the default smoothing method. Each element must be a valid smoothing method, and a name which specifies a column in data. Suppose data has two columns: ATT1 and ATT2, then we can set the smoothing method for these two columns as follows: col.smoothing = list("ATT1" = "bin.means", "ATT2" = "bin.boundaries") or equivalently col.smoothing = c(ATT1 = "bin.means", ATT2 = "bin.boundaries") Only applies for numerical attributes. No default value.
n.bins	`integer, optional` Number of needed bins. Only valid when strategy is "uniform.number" or "quantile". Defaults to 2.
bin.size	`double, optional` Specifies the distance for binning. Only valid when strategy is 'uniform.size'. Defaults to 10.
n.sd	`integer, optional` Specifies the number of standard deviation at each side of the mean. Only valid when strategy is 'sd'. Defaults to 1.
categorical.variable	`character or list/vector of characters, optional` Indicates features should be treated as categorical variable. The default behavior is dependent on what input is given: "VARCHAR" and "NVARCHAR": categorical "INTEGER" and "DOUBLE": continuous. VALID only for variables of "INTEGER" type, omitted otherwise. No default value.
save.model	`logical, optional` `FALSE`: not save `TRUE`: save

Value

A "Discretize" object with the following attributes:

result: DataFrame
Discretize results, structured as follows:
- ID : name as shown in input DataFrame
- FEATURES : data smoothed respectively in each bins
assignment: DataFrame
Assignment results, structured as follows:
- ID : data ID, name as shown in input DataFrame.
- BIN_INDEX : bin index.
model: DataFrame
Model results, structured as follows:
- ROW_INDEX : row index.
- MODEL_CONTENT : model contents.
statistics: DataFrame
Statistic results, structured as follows:
- STAT_NAME : statistic name.
- STAT_VALUE : statistic value.

Details

It is an enhanced version of binning function which can be applied to table with multiple columns. This function partitions table rows into multiple segments called bins, then applies smoothing methods in each bin of each column respectively.

Examples

Input DataFrame data:

   ID ATT1 ATT2 ATT3 ATT4
1   1 10.0  100    1    A
2   2 10.1  101    1    A
3   3 10.2  100    1    A
4   4 10.4  103    1    A
5   5 10.3  100    1    A
6   6 40.0  400    4    C
7   7 40.1  402    4    B
8   8 40.2  400    4    B
9   9 40.4  402    4    B
10 10 40.3  400    4    A
11 11 90.0  900    2    C
12 12 90.1  903    1    B
13 13 90.2  901    2    B
14 14 90.4  900    1    B
15 15 90.3  900    1    B

Call the function and a "Discretize" object discretize is returned:

> discretize <- hanaml.Discretize(data,
                                  key = "ID",
                                  features = c("ATT1", "ATT2",  "ATT3", "ATT4"),
                                  binning.variable = "ATT1",
                                  strategy = "uniform.number",
                                  smoothing = "bin.boundaries",
                                  col.smoothing = list(ATT2  = "bin.means"),
                                  n.bins = 3,
                                  categorical.variable = "ATT3")

Expected output:

> discretize$result$Collect()
   ID ATT1  ATT2 ATT3 ATT4
1   1 10.2 100.8    1    A
2   2 10.2 100.8    1    A
3   3 10.2 100.8    1    A
4   4 10.2 100.8    1    A
5   5 10.2 100.8    1    A
6   6 40.2 400.8    4    C
7   7 40.2 400.8    4    B
8   8 40.2 400.8    4    B
9   9 40.2 400.8    4    B
10 10 40.2 400.8    4    A
11 11 90.2 900.8    2    C
12 12 90.2 900.8    1    B
13 13 90.2 900.8    2    B
14 14 90.2 900.8    1    B
15 15 90.2 900.8    1    B

Arguments

Value

Details

Examples

See also