hanaml.Partition.Rd
hanaml.Partition is a R wrapper for SAP HANA PAL Partition algorithm.
hanaml.Partition(
data,
key,
features = NULL,
random.state = NULL,
thread.ratio = NULL,
method = NULL,
stratified.column = NULL,
split.ratio = NULL,
split.size = NULL
)
DataFrame
DataFrame containting the data.
character
Name of the ID column.
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
integer, optional
Indicates the seed used to initialize the random number generator.
0
: Uses the system time
Not 0
: Uses the specified seed
Defaults to 0.
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
character, optional
Partition method used for splitting dataset into train, test and validation sets:
"random"
: random partitions
"stratified"
: stratified partition
Defaults to "random".
character, optional
Indicates which column is used for stratification in the partition process.
Required and valid only when method
is set to "stratified"
(stratified partition).
No default value.
list of double, optional
List of 3 numerical numbers that specifies the percent of data used for training, testing
and validation respectively.
If both split.ratio and split.size are specified, split.ratio takes precedence.
If not provided, defaults to c(0.8, 0.1, 0.1), i.e. 80 percent data used for training,
10 percent data used for testing and 10 percent data used for validation.
list of integers, optional
List of 3 integers that specifies the number of rows in data used for training, testing
and validation respectively.
If both split.ratio and split.size are specified, split.ratio
takes precedence.
No default value.
List of DataFrames
DataFrames for training, testing and validation, arranged in the following order:
DataFrame 1: training,
DataFrame 2: testing,
DataFrame 3: validation.
Input DataFrame data:
> data$collect()
ID HomeOwner MaritalStatus AnnualIncome DefaultedBorrower
1 0 YES Single 125 NO
2 1 NO Married 100 NO
3 2 NO Single 70 NO
4 3 YES Married 120 NO
5 4 NO Divorced 95 YES
...
28 27 NO Single 85 YES
29 28 NO Married 75 YES
30 29 NO Single 90 YES
Call the function:
> partition <- hanaml.Partition(data,
random.state = 23,
method = "random",
split.ratio = c(0.6, 0.2, 0.2))
Output:
> partition[[1]]$Collect()
ID HomeOwner MaritalStatus AnnualIncome DefaultedBorrower
1 0 YES Single 125 NO
2 1 NO Married 100 NO
3 3 YES Married 120 NO
4 5 NO Married 60 NO
5 7 NO Single 85 YES
6 10 YES Single 125 NO
7 12 NO Single 70 NO
8 13 YES Married 120 NO
9 17 NO Single 85 YES
10 18 NO Married 75 NO
11 21 NO Married 100 NO
12 22 NO Single 70 NO
13 23 YES Married 120 NO
14 24 NO Divorced 95 YES
15 25 NO Married 60 NO
16 27 NO Single 85 YES
17 28 NO Married 75 YES
18 29 NO Single 90 YES