hanaml.Discretize.Rd
hanaml.Discretize is a R wrapper for SAP HANA PAL Discretize.
hanaml.Discretize(
data = NULL,
key = NULL,
features = NULL,
binning.variable = NULL,
strategy = NULL,
smoothing = NULL,
col.smoothing = NULL,
n.bins = NULL,
bin.size = NULL,
n.sd = NULL,
categorical.variable = NULL,
save.model = NULL
)
DataFrame
DataFrame containting the data.
character
Name of the ID column.
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
character
Attribute name, to which binning operation is applied.
character
"uniform.number"
: equal widths based on the number of bins
"uniform.size"
: equal widths based on the bin width
"quantile"
: equal number of records per bin
"sd"
: mean/ standard deviation bin boundaries
character, optional
"no"
: no smoothing
"bin.means"
: smoothing by bin means
"bin.medians"
: smoothing by bin medians
"bin.boundaries"
: smoothing by bin boundaries
Only applies for none-categorical attributes that do not get specified
smoothing method by parameter col.smoothing.
No default value.
list of characters, optional
Specifies smoothing method for columns, which overwrites the default smoothing method.
Each element must be a valid smoothing method, and a name which specifies a column in
data.
Suppose data has two columns: ATT1 and ATT2, then we can set the smoothing method
for these two columns as follows:
col.smoothing = list("ATT1" = "bin.means", "ATT2" = "bin.boundaries")
or equivalently
col.smoothing = c(ATT1 = "bin.means", ATT2 = "bin.boundaries")
Only applies for numerical attributes.
No default value.
integer, optional
Number of needed bins.
Only valid when strategy is "uniform.number" or "quantile".
Defaults to 2.
double, optional
Specifies the distance for binning.
Only valid when strategy is 'uniform.size'.
Defaults to 10.
integer, optional
Specifies the number of standard deviation at each side of the mean.
Only valid when strategy is 'sd'.
Defaults to 1.
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.
logical, optional
FALSE
: not save
TRUE
: save
A "Discretize" object with the following attributes:
result: DataFrame
Discretize results, structured as follows:
ID : name as shown in input DataFrame
FEATURES : data smoothed respectively in each bins
assignment: DataFrame
Assignment results, structured as follows:
ID : data ID, name as shown in input DataFrame.
BIN_INDEX : bin index.
model: DataFrame
Model results, structured as follows:
ROW_INDEX : row index.
MODEL_CONTENT : model contents.
statistics: DataFrame
Statistic results, structured as follows:
STAT_NAME : statistic name.
STAT_VALUE : statistic value.
It is an enhanced version of binning function which can be applied to table with multiple columns. This function partitions table rows into multiple segments called bins, then applies smoothing methods in each bin of each column respectively.
Input DataFrame data:
> data$Collect()
ID ATT1 ATT2 ATT3 ATT4
1 1 10.0 100 1 A
2 2 10.1 101 1 A
3 3 10.2 100 1 A
4 4 10.4 103 1 A
5 5 10.3 100 1 A
6 6 40.0 400 4 C
7 7 40.1 402 4 B
8 8 40.2 400 4 B
9 9 40.4 402 4 B
10 10 40.3 400 4 A
11 11 90.0 900 2 C
12 12 90.1 903 1 B
13 13 90.2 901 2 B
14 14 90.4 900 1 B
15 15 90.3 900 1 B
Call the function and a "Discretize" object discretize is returned:
> discretize <- hanaml.Discretize(data,
key = "ID",
features = c("ATT1", "ATT2", "ATT3", "ATT4"),
binning.variable = "ATT1",
strategy = "uniform.number",
smoothing = "bin.boundaries",
col.smoothing = list(ATT2 = "bin.means"),
n.bins = 3,
categorical.variable = "ATT3")
Expected output:
> discretize$result$Collect()
ID ATT1 ATT2 ATT3 ATT4
1 1 10.2 100.8 1 A
2 2 10.2 100.8 1 A
3 3 10.2 100.8 1 A
4 4 10.2 100.8 1 A
5 5 10.2 100.8 1 A
6 6 40.2 400.8 4 C
7 7 40.2 400.8 4 B
8 8 40.2 400.8 4 B
9 9 40.2 400.8 4 B
10 10 40.2 400.8 4 A
11 11 90.2 900.8 2 C
12 12 90.2 900.8 1 B
13 13 90.2 900.8 2 B
14 14 90.2 900.8 1 B
15 15 90.2 900.8 1 B