R: Sampling

hanaml.Sampling {hana.ml.r}

R Documentation

Sampling

Description

hanaml.sampling is a R wrapper for PAL sampling.

Usage

hanaml.Sampling (conn.context, data, method, interval = NULL,
                        features = NULL, sampling.size = NULL,
                        random.state = NULL,  percentage = NULL)

Arguments

`conn.context`	`ConnectionContext` The connection to the SAP HANA system.
`data`	`DataFrame` DataFrame containing the data.
`method`	`character, optional` `'first_n'`: The first n data. `'middle_n'`: Middle n data. `'last_n'`: The last n data. `'every_nth'`: based on interval to sample data. `'simple_random_with_replacement'`: For the random methods, the system time is used for the seed. `'simple_random_without_replacement'`: For the random methods, the system time is used for the seed. `'systematic'`: For the random methods, the system time is used for the seed. `'stratified_with_replacement'` `'stratified_without_replacement'`
`interval`	`integer, optional` The interval between two samples. Note that Only required when method is every_nth. If this parameter is not specified, the sampling.size parameter will be used.
`features`	`character or list of character, optional` The column that is used to do the stratified sampling. Only required when method is stratified_with_replacement, or stratified_without_replacement.
`sampling.size`	`integer, optional` Number of the samples. Default to 1.
`random.state`	`integer, optional` Indicates the seed used to initialize the random number generator. It can be set to 0 or a positive value. `0`: Uses the system time `Not 0`: Uses the specified seed Default to 0.
`percentage`	`double, optional` Percentage of the samples. Use this parameter when sampling.size is not set. If both sampling.size and percentage are specified, percentage takes precedence. Default to 0.1.

Details

This function is used to choose a small portion of the records as representatives.

Value

DataFrame
The same column structure (number of columns, column names, and column types) with the table with which the model is trained.

Examples

## Not run: 
 Input DataFrame data for sampling:
 > data$Collect()
      EMPNO GENDER INCOME
  1      1   male 4000.5
  2      2   male 5000.7
  3      3 female 5100.8
  4      4   male 5400.9
  5      5 female 5500.2
  ....
  23    23   male 8576.9
  24    24   male 9560.9
  25    25 female 8794.9

 Call Sampling function:
 >  sampling <- hanaml.Sampling(conn, data, method = 'first_n',
                                sampling.size = 8, interval = 5,
                                features = "GENDER")
 Expected output:
 > sampling$Collect()
   EMPNO GENDER INCOME
 1     1   male 4000.5
 2     2   male 5000.7
 3     3 female 5100.8
 4     4   male 5400.9
 5     5 female 5500.2
 6     6   male 5540.4
 7     7   male 4500.9
 8     8 female 6000.8

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]