R: Partition

hanaml.Partition {hana.ml.r}

R Documentation

Partition

Description

hanaml.Partition is a R wrapper for PAL Partition algorithm.

Usage

hanaml.Partition(conn.context,
                 data,
                 key,
                 features = NULL,
                 random.state = NULL,
                 thread.ratio = NULL, method = NULL,
                 stratified.column = NULL,
                 split.ratio = NULL,
                 split.size = NULL)

Arguments

`conn.context`	`ConnectionContext` Database connection object
`data`	`DataFrame` Dataset used for training the linear model.
`key`	`character` Name of the ID column.
`features`	`list of characters, optional` Names of the feature columns. If not provided, it defaults to all the non-ID and non-label columns.
`random.state`	`integer, optional` Indicates the seed used to initialize the random number generator. `0`: Uses the system time `Not 0`: Uses the specified seed
`thread.ratio`	`double, optional` Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
`method`	`character, optional` Partition method used for splitting dataset into train, test and validation sets: `"random"`: random partitions `"stratified"`: stratified partition Defaults to "random".
`stratified.column`	`character, optional` Indicates which column is used for stratification in the partition process. Required and valid only when parition_method is set to 'stratified' (stratified partition). No default value.
`split.ratio`	`list of numeric, optional` List of 3 numerical numbers that specifies the percent of data used for training, testing and validation respectively. If not provided, defaults to c(0.8, 0.1, 0.1), i.e. 80 percent data used for training, 10 percent data used for testing and 10 percent data used for validation.
`split.size`	`list of integers, optional` List of 3 integers that specifies the number of rows in data used for training, testing and validation respectively.

Format

R6Class object.

Value

List of DataFrame
DataFrames for training, testing and validation, arranged in the following order:
- 1 DataFrame for training,
- 2 DataFrame for testing,
- 3 DataFrame for validation.

Examples

## Not run: 
   Input DataFrame for Preprocessing:
> data$collect()
     ID HomeOwner MaritalStatus AnnualIncome DefaultedBorrower
1   0       YES        Single          125                NO
2   1        NO       Married          100                NO
3   2        NO        Single           70                NO
4   3       YES       Married          120                NO
5   4        NO      Divorced           95               YES
...
28 27        NO        Single           85               YES
29 28        NO       Married           75               YES
30 29        NO        Single           90               YES

 Create partition instance:
 > partition <- hanaml.Partition(conn, data, random.state = 23,
                                 method = "random",
                                 split.ratio = c(0.6, 0.2, 0.2))
Expected output:

 > partition[[1]]$Collect()
    ID HomeOwner MaritalStatus AnnualIncome DefaultedBorrower
 1   0       YES        Single          125                NO
 2   1        NO       Married          100                NO
 3   3       YES       Married          120                NO
 4   5        NO       Married           60                NO
 5   7        NO        Single           85               YES
 6  10       YES        Single          125                NO
 7  12        NO        Single           70                NO
 8  13       YES       Married          120                NO
 9  17        NO        Single           85               YES
 10 18        NO       Married           75                NO
 11 21        NO       Married          100                NO
 12 22        NO        Single           70                NO
 13 23       YES       Married          120                NO
 14 24        NO      Divorced           95               YES
 15 25        NO       Married           60                NO
 16 27        NO        Single           85               YES
 17 28        NO       Married           75               YES
 18 29        NO        Single           90               YES

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]