hanaml.Sampling.Rd
hanaml.Sampling is a R wrapper for SAP HANA PAL sampling.
hanaml.Sampling(
data,
method,
interval = NULL,
features = NULL,
sampling.size = NULL,
random.state = NULL,
percentage = NULL,
stratified.columns = NULL
)
DataFrame
DataFrame containting the data.
character
"first_n"
: The first n data.
"middle_n"
: Middle n data.
"last_n"
: The last n data.
"every_nth"
: based on interval to sample data.
"simple_random_with_replacement"
:
For the random methods, the system time is used for the seed.
"simple_random_without_replacement"
:
For the random methods, the system time is used for the seed.
"systematic"
:
For the random methods, the system time is used for the seed.
"stratified_with_replacement"
"stratified_without_replacement"
integer, optional
The interval between two samples.
Note that Only valid when method
is "every_nth".
If this parameter is not specified, then sampling.size
parameter will
be used.
vector/list of characters, optional(deprecated)
The column that is used to do the stratified sampling.
Only required when method
is "stratified_with_replacement",
or "stratified_without_replacement".
Will be replaced by another parameter stratified.columns
in future release.
integer, optional
Number of the samples.
Not effective when method
is "every_nth" and interval
is specified, or when pencentage
is specified.
Default to 1.
integer, optional
0
: Uses the system time
Not 0
: Uses the specified seed
Default to 0.
double, optional
Percentage of the samples.
Use this parameter when sampling.size is not set.
If both sampling.size and percentage are specified,
percentage takes precedence.
Default to 0.1.
vector/list of characters, optional
Specifies the set of columns that are used to do the stratified sampling.
Only required when method
is "stratified_with_replacement",
or "stratified_without_replacement".
If both features
and stratified.columns
are specified,
stratified.columns
takes precedence.
DataFrame
The same column structure (number of columns, column names, and column
types) with the table with which the model is trained.
This function is used to choose a small portion of the records as representatives.
Input DataFrame data:
> data$Collect()
EMPNO GENDER INCOME
1 1 male 4000.5
2 2 male 5000.7
3 3 female 5100.8
4 4 male 5400.9
5 5 female 5500.2
....
23 23 male 8576.9
24 24 male 9560.9
25 25 female 8794.9
Call Sampling function:
> sampling <- hanaml.Sampling(data, method = "first_n",
sampling.size = 8,
interval = 5,
features = "GENDER")
Output:
> sampling$Collect()
EMPNO GENDER INCOME
1 1 male 4000.5
2 2 male 5000.7
3 3 female 5100.8
4 4 male 5400.9
5 5 female 5500.2
6 6 male 5540.4
7 7 male 4500.9
8 8 female 6000.8