Sampling — hanaml.Sampling • hana.ml.r

hanaml.Sampling is a R wrapper for SAP HANA PAL sampling.

hanaml.Sampling(
  data,
  method,
  interval = NULL,
  features = NULL,
  sampling.size = NULL,
  random.state = NULL,
  percentage = NULL,
  stratified.columns = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

method

character

"first_n": The first n data.
"middle_n": Middle n data.
"last_n": The last n data.
"every_nth": based on interval to sample data.
"simple_random_with_replacement": For the random methods, the system time is used for the seed.
"simple_random_without_replacement": For the random methods, the system time is used for the seed.
"systematic": For the random methods, the system time is used for the seed.
"stratified_with_replacement"
"stratified_without_replacement"

interval

integer, optional
The interval between two samples.
Note that Only valid when method is "every_nth".
If this parameter is not specified, then sampling.size parameter will be used.

features

vector/list of characters, optional(deprecated)
The column that is used to do the stratified sampling. Only required when method is "stratified_with_replacement", or "stratified_without_replacement".
Will be replaced by another parameter stratified.columns in future release.

sampling.size

integer, optional
Number of the samples.
Not effective when method is "every_nth" and interval is specified, or when pencentage is specified.
Default to 1.

random.state

integer, optional

0: Uses the system time
Not 0: Uses the specified seed

Default to 0.

percentage

double, optional
Percentage of the samples. Use this parameter when sampling.size is not set. If both sampling.size and percentage are specified, percentage takes precedence.
Default to 0.1.

stratified.columns

vector/list of characters, optional
Specifies the set of columns that are used to do the stratified sampling.
Only required when method is "stratified_with_replacement", or "stratified_without_replacement". If both features and stratified.columns are specified, stratified.columns takes precedence.

Value

DataFrame
The same column structure (number of columns, column names, and column types) with the table with which the model is trained.

Details

This function is used to choose a small portion of the records as representatives.

Examples

Input DataFrame data:


 > data$Collect()
    EMPNO GENDER INCOME
 1      1   male 4000.5
 2      2   male 5000.7
 3      3 female 5100.8
 4      4   male 5400.9
 5      5 female 5500.2
 ....
 23    23   male 8576.9
 24    24   male 9560.9
 25    25 female 8794.9

Call Sampling function:


> sampling <- hanaml.Sampling(data, method = "first_n",
                               sampling.size = 8,
                               interval = 5,
                               features = "GENDER")

Output:


> sampling$Collect()
  EMPNO GENDER INCOME
1     1   male 4000.5
2     2   male 5000.7
3     3 female 5100.8
4     4   male 5400.9
5     5 female 5500.2
6     6   male 5540.4
7     7   male 4500.9
8     8 female 6000.8