Partition — hanaml.Partition • hana.ml.r

hanaml.Partition is a R wrapper for SAP HANA PAL Partition algorithm.

hanaml.Partition(
  data,
  key,
  features = NULL,
  random.state = NULL,
  thread.ratio = NULL,
  method = NULL,
  stratified.column = NULL,
  split.ratio = NULL,
  split.size = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
key	`character` Name of the ID column.
features	`character of list of characters, optional` Name of feature columns. If not provided, it defaults all non-key, non-label columns of data.
random.state	`integer, optional` Indicates the seed used to initialize the random number generator. `0`: Uses the system time `Not 0`: Uses the specified seed Defaults to 0.
thread.ratio	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored. Defaults to 0.
method	`character, optional` Partition method used for splitting dataset into train, test and validation sets: `"random"`: random partitions `"stratified"`: stratified partition Defaults to "random".
stratified.column	`character, optional` Indicates which column is used for stratification in the partition process. Required and valid only when `method` is set to "stratified" (stratified partition). No default value.
split.ratio	`list of double, optional` List of 3 numerical numbers that specifies the percent of data used for training, testing and validation respectively. If both split.ratio and split.size are specified, split.ratio takes precedence. If not provided, defaults to c(0.8, 0.1, 0.1), i.e. 80 percent data used for training, 10 percent data used for testing and 10 percent data used for validation.
split.size	`list of integers, optional` List of 3 integers that specifies the number of rows in data used for training, testing and validation respectively. If both split.ratio and split.size are specified, `split.ratio` takes precedence. No default value.

Value

List of DataFrames
DataFrames for training, testing and validation, arranged in the following order:

DataFrame 1: training,
DataFrame 2: testing,
DataFrame 3: validation.

Examples

Input DataFrame data:

> data$collect()
   ID HomeOwner MaritalStatus AnnualIncome DefaultedBorrower
1   0       YES        Single          125                NO
2   1        NO       Married          100                NO
3   2        NO        Single           70                NO
4   3       YES       Married          120                NO
5   4        NO      Divorced           95               YES
...
28 27        NO        Single           85               YES
29 28        NO       Married           75               YES
30 29        NO        Single           90               YES

Call the function:

> partition <- hanaml.Partition(data,
                                random.state = 23,
                                method = "random",
                                split.ratio = c(0.6, 0.2, 0.2))

Output:

> partition[[1]]$Collect()
    ID HomeOwner MaritalStatus AnnualIncome DefaultedBorrower
 1   0       YES        Single          125                NO
 2   1        NO       Married          100                NO
 3   3       YES       Married          120                NO
 4   5        NO       Married           60                NO
 5   7        NO        Single           85               YES
 6  10       YES        Single          125                NO
 7  12        NO        Single           70                NO
 8  13       YES       Married          120                NO
 9  17        NO        Single           85               YES
 10 18        NO       Married           75                NO
 11 21        NO       Married          100                NO
 12 22        NO        Single           70                NO
 13 23       YES       Married          120                NO
 14 24        NO      Divorced           95               YES
 15 25        NO       Married           60                NO
 16 27        NO        Single           85               YES
 17 28        NO       Married           75               YES
 18 29        NO        Single           90               YES