Sampling — hanaml.Sampling • hana.ml.r

hanaml.Sampling is a R wrapper for SAP HANA PAL sampling.

hanaml.Sampling(
  data,
  method,
  interval = NULL,
  features = NULL,
  sampling.size = NULL,
  random.state = NULL,
  percentage = NULL,
  stratified.columns = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
method	`character` `"first_n"`: The first n data. `"middle_n"`: Middle n data. `"last_n"`: The last n data. `"every_nth"`: based on interval to sample data. `"simple_random_with_replacement"`: For the random methods, the system time is used for the seed. `"simple_random_without_replacement"`: For the random methods, the system time is used for the seed. `"systematic"`: For the random methods, the system time is used for the seed. `"stratified_with_replacement"` `"stratified_without_replacement"`
interval	`integer, optional` The interval between two samples. Note that Only valid when `method` is "every_nth". If this parameter is not specified, then `sampling.size` parameter will be used.
features	`vector/list of character, optional(deprecated)` The column that is used to do the stratified sampling. Only required when `method` is "stratified_with_replacement", or "stratified_without_replacement". Will be replaced by another parameter `stratified.columns` in future release.
sampling.size	`integer, optional` Number of the samples. Not effective when `method` is "every_nth" and `interval` is specified, or when `pencentage` is specified. Default to 1.
random.state	`integer, optional` `0`: Uses the system time `Not 0`: Uses the specified seed Default to 0.
percentage	`double, optional` Percentage of the samples. Use this parameter when sampling.size is not set. If both sampling.size and percentage are specified, percentage takes precedence. Default to 0.1.
stratified.columns	`vector/list of character, optional` Specifies the set of columns that are used to do the stratified sampling. Only required when `method` is "stratified_with_replacement", or "stratified_without_replacement". If both `features` and `stratified.columns` are specified, `stratified.columns` takes precedence.

Value

DataFrame
The same column structure (number of columns, column names, and column types) with the table with which the model is trained.

Details

This function is used to choose a small portion of the records as representatives.

Examples

Input DataFrame data:

 > data$Collect()
    EMPNO GENDER INCOME
 1      1   male 4000.5
 2      2   male 5000.7
 3      3 female 5100.8
 4      4   male 5400.9
 5      5 female 5500.2
 ....
 23    23   male 8576.9
 24    24   male 9560.9
 25    25 female 8794.9

Call Sampling function:

> sampling <- hanaml.Sampling(data, method = "first_n",
                               sampling.size = 8,
                               interval = 5,
                               features = "GENDER")

Output:

> sampling$Collect()
  EMPNO GENDER INCOME
1     1   male 4000.5
2     2   male 5000.7
3     3 female 5100.8
4     4   male 5400.9
5     5 female 5500.2
6     6   male 5540.4
7     7   male 4500.9
8     8 female 6000.8