hanaml.Sampling is a R wrapper
for SAP HANA PAL sampling.
hanaml.Sampling(
data,
method,
interval = NULL,
features = NULL,
sampling.size = NULL,
random.state = NULL,
percentage = NULL,
stratified.columns = NULL
)
Arguments
| data |
DataFrame
DataFrame containting the data.
|
| method |
character
"first_n": The first n data.
"middle_n": Middle n data.
"last_n": The last n data.
"every_nth": based on interval to sample data.
"simple_random_with_replacement":
For the random methods, the system time is used for the seed.
"simple_random_without_replacement":
For the random methods, the system time is used for the seed.
"systematic":
For the random methods, the system time is used for the seed.
"stratified_with_replacement"
"stratified_without_replacement"
|
| interval |
integer, optional
The interval between two samples.
Note that Only valid when method is "every_nth".
If this parameter is not specified, then sampling.size parameter will
be used.
|
| features |
vector/list of character, optional(deprecated)
The column that is used to do the stratified sampling.
Only required when method is "stratified_with_replacement",
or "stratified_without_replacement".
Will be replaced by another parameter stratified.columns in future release.
|
| sampling.size |
integer, optional
Number of the samples.
Not effective when method is "every_nth" and interval
is specified, or when pencentage is specified.
Default to 1.
|
| random.state |
integer, optional
Default to 0. |
| percentage |
double, optional
Percentage of the samples.
Use this parameter when sampling.size is not set.
If both sampling.size and percentage are specified,
percentage takes precedence.
Default to 0.1.
|
| stratified.columns |
vector/list of character, optional
Specifies the set of columns that are used to do the stratified sampling.
Only required when method is "stratified_with_replacement",
or "stratified_without_replacement".
If both features and stratified.columns are specified,
stratified.columns takes precedence.
|
Value
DataFrame
The same column structure (number of columns, column names, and column
types) with the table with which the model is trained.
Details
This function is used to choose a small portion of the records
as representatives.
Examples
Input DataFrame data:
> data$Collect()
EMPNO GENDER INCOME
1 1 male 4000.5
2 2 male 5000.7
3 3 female 5100.8
4 4 male 5400.9
5 5 female 5500.2
....
23 23 male 8576.9
24 24 male 9560.9
25 25 female 8794.9
Call Sampling function:
> sampling <- hanaml.Sampling(data, method = "first_n",
sampling.size = 8,
interval = 5,
features = "GENDER")
Output:
> sampling$Collect()
EMPNO GENDER INCOME
1 1 male 4000.5
2 2 male 5000.7
3 3 female 5100.8
4 4 male 5400.9
5 5 female 5500.2
6 6 male 5540.4
7 7 male 4500.9
8 8 female 6000.8