hanaml.Sampling {hana.ml.r}R Documentation

Sampling

Description

hanaml.sampling is a R wrapper for PAL sampling.

Usage

hanaml.Sampling (conn.context, data, method, interval = NULL,
                        features = NULL, sampling.size = NULL,
                        random.state = NULL,  percentage = NULL)

Arguments

conn.context

ConnectionContext
The connection to the SAP HANA system.

data

DataFrame
DataFrame containing the data.

method

character, optional

  • 'first_n': The first n data.

  • 'middle_n': Middle n data.

  • 'last_n': The last n data.

  • 'every_nth': based on interval to sample data.

  • 'simple_random_with_replacement': For the random methods, the system time is used for the seed.

  • 'simple_random_without_replacement': For the random methods, the system time is used for the seed.

  • 'systematic': For the random methods, the system time is used for the seed.

  • 'stratified_with_replacement'

  • 'stratified_without_replacement'

interval

integer, optional
The interval between two samples.
Note that Only required when method is every_nth.
If this parameter is not specified, the sampling.size parameter will be used.

features

character or list of character, optional
The column that is used to do the stratified sampling. Only required when method is stratified_with_replacement, or stratified_without_replacement.

sampling.size

integer, optional
Number of the samples.
Default to 1.

random.state

integer, optional

Indicates the seed used to initialize the random number generator. It can be set to 0 or a positive value.

  • 0: Uses the system time

  • Not 0: Uses the specified seed

Default to 0.

percentage

double, optional
Percentage of the samples. Use this parameter when sampling.size is not set. If both sampling.size and percentage are specified, percentage takes precedence.
Default to 0.1.

Details

This function is used to choose a small portion of the records as representatives.

Value

Examples

## Not run: 
 Input DataFrame data for sampling:
 > data$Collect()
      EMPNO GENDER INCOME
  1      1   male 4000.5
  2      2   male 5000.7
  3      3 female 5100.8
  4      4   male 5400.9
  5      5 female 5500.2
  ....
  23    23   male 8576.9
  24    24   male 9560.9
  25    25 female 8794.9

 Call Sampling function:
 >  sampling <- hanaml.Sampling(conn, data, method = 'first_n',
                                sampling.size = 8, interval = 5,
                                features = "GENDER")
 Expected output:
 > sampling$Collect()
   EMPNO GENDER INCOME
 1     1   male 4000.5
 2     2   male 5000.7
 3     3 female 5100.8
 4     4   male 5400.9
 5     5 female 5500.2
 6     6   male 5540.4
 7     7   male 4500.9
 8     8 female 6000.8

## End(Not run)

[Package hana.ml.r version 1.0.8 Index]