Partition — hanaml.Partition • hana.ml.r

hanaml.Partition is a R wrapper for SAP HANA PAL Partition algorithm.

hanaml.Partition(
  data,
  key,
  features = NULL,
  random.state = NULL,
  thread.ratio = NULL,
  method = NULL,
  stratified.column = NULL,
  split.ratio = NULL,
  split.size = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

random.state

integer, optional
Indicates the seed used to initialize the random number generator.

0: Uses the system time
Not 0: Uses the specified seed

Defaults to 0.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

method

character, optional
Partition method used for splitting dataset into train, test and validation sets:

"random": random partitions
"stratified": stratified partition

Defaults to "random".

stratified.column

character, optional
Indicates which column is used for stratification in the partition process.
Required and valid only when method is set to "stratified" (stratified partition).
No default value.

split.ratio

list of double, optional
List of 3 numerical numbers that specifies the percent of data used for training, testing and validation respectively.
If both split.ratio and split.size are specified, split.ratio takes precedence.
If not provided, defaults to c(0.8, 0.1, 0.1), i.e. 80 percent data used for training, 10 percent data used for testing and 10 percent data used for validation.

split.size

list of integers, optional
List of 3 integers that specifies the number of rows in data used for training, testing and validation respectively.
If both split.ratio and split.size are specified, split.ratio takes precedence.
No default value.

Value

List of DataFrames
DataFrames for training, testing and validation, arranged in the following order:

DataFrame 1: training,
DataFrame 2: testing,
DataFrame 3: validation.

Examples

Input DataFrame data:


> data$collect()
   ID HomeOwner MaritalStatus AnnualIncome DefaultedBorrower
1   0       YES        Single          125                NO
2   1        NO       Married          100                NO
3   2        NO        Single           70                NO
4   3       YES       Married          120                NO
5   4        NO      Divorced           95               YES
...
28 27        NO        Single           85               YES
29 28        NO       Married           75               YES
30 29        NO        Single           90               YES

Call the function:


> partition <- hanaml.Partition(data,
                                random.state = 23,
                                method = "random",
                                split.ratio = c(0.6, 0.2, 0.2))

Output:


> partition[[1]]$Collect()
    ID HomeOwner MaritalStatus AnnualIncome DefaultedBorrower
 1   0       YES        Single          125                NO
 2   1        NO       Married          100                NO
 3   3       YES       Married          120                NO
 4   5        NO       Married           60                NO
 5   7        NO        Single           85               YES
 6  10       YES        Single          125                NO
 7  12        NO        Single           70                NO
 8  13       YES       Married          120                NO
 9  17        NO        Single           85               YES
 10 18        NO       Married           75                NO
 11 21        NO       Married          100                NO
 12 22        NO        Single           70                NO
 13 23       YES       Married          120                NO
 14 24        NO      Divorced           95               YES
 15 25        NO       Married           60                NO
 16 27        NO        Single           85               YES
 17 28        NO       Married           75               YES
 18 29        NO        Single           90               YES