class hana_ml.algorithms.pal.preprocessing.Sampling(method, interval=None, sampling_size=None, random_state=None, percentage=None)

This class is used to choose a small portion of the records as representatives.


Specifies the sampling method.

Valid options include: 'first_n', 'middle_n', 'last_n', 'every_nth', 'simple_random_with_replacement', 'simple_random_without_replacement', 'systematic', 'stratified_with_replacement', 'stratified_without_replacement'.

For the random methods, the system time is used for the seed.

intervalint, optional

The interval between two samples.

Only required when method is 'every_nth'.

If this parameter is not specified, the sampling_size parameter will be used.

sampling_sizeint, optional

Number of the samples.

Default to 1.

random_stateint, optional

Indicates the seed used to initialize the random number generator.

It can be set to 0 or a positive value, where:
  • 0: Uses the system time

  • Others: Uses the specified seed

Default to 0.

percentagefloat, optional

Percentage of the samples.

Use this parameter when sampling_size is not set.

If both sampling_size and percentage are specified, percentage takes precedence.

Default to 0.1.


>>> smp = Sampling(method='every_nth', interval=5, sampling_size=8)
>>> res = smp.fit_transform(data=df)
>>> res.collect()


fit_transform(data[, features])

Sampling the input dataset under specified configuration.

fit_transform(data, features=None)

Sampling the input dataset under specified configuration.


Input DataFrame.

featuresstr/ListofStrings, optional

The column that is used to do the stratified sampling.

Only required when method is 'stratified_with_replacement', or 'stratified_without_replacement'.

Defaults to None.


Sampling results, same structure as defined in the input DataFrame.

Inherited Methods from PALBase

Besides those methods mentioned above, the Sampling class also inherits methods from PALBase class, please refer to PAL Base for more details.