Sampling

class hana_ml.algorithms.pal.preprocessing.Sampling(method, interval=None, sampling_size=None, random_state=None, percentage=None)

This class is used to choose a small portion of the records as representatives.

Parameters
methodstr

Specifies the sampling method.

Valid options include: 'first_n', 'middle_n', 'last_n', 'every_nth', 'simple_random_with_replacement', 'simple_random_without_replacement', 'systematic', 'stratified_with_replacement', 'stratified_without_replacement'.

For the random methods, the system time is used for the seed.

intervalint, optional

The interval between two samples.

Only required when method is 'every_nth'.

If this parameter is not specified, the sampling_size parameter will be used.

sampling_sizeint, optional

Number of the samples.

Default to 1.

random_stateint, optional

Indicates the seed used to initialize the random number generator.

It can be set to 0 or a positive value, where:
  • 0: Uses the system time

  • Others: Uses the specified seed

Default to 0.

percentagefloat, optional

Percentage of the samples.

Use this parameter when sampling_size is not set.

If both sampling_size and percentage are specified, percentage takes precedence.

Default to 0.1.

Examples

Original data:

>>> df.collect().head(10)
    EMPNO  GENDER  INCOME
0       1    male  4000.5
1       2    male  5000.7
2       3  female  5100.8
3       4    male  5400.9
4       5  female  5500.2
5       6    male  5540.4
6       7    male  4500.9
7       8  female  6000.8
8       9    male  7120.8
9      10  female  8120.9

Apply the sampling function:

>>> smp = Sampling(method='every_nth', interval=5, sampling_size=8)
>>> res = smp.fit_transform(data=df)
>>> res.collect()
   EMPNO  GENDER  INCOME
0      5  female  5500.2
1     10  female  8120.9
2     15    male  9876.5
3     20  female  8705.7
4     25  female  8794.9
Attributes
None

Methods

fit_transform(data[, features])

Sampling the input dataset under specified configuration.

fit_transform(data, features=None)

Sampling the input dataset under specified configuration.

Parameters
dataDataFrame

Input Dataframe.

featuresstr/ListofStrings, optional

The column that is used to do the stratified sampling.

Only required when method is 'stratified_with_replacement', or 'stratified_without_replacement'.

Defaults to None.

Returns
DataFrame

Sampling results, same structure as defined in the Input DataFrame.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the Sampling class also inherits methods from PALBase class, please refer to PAL Base for more details.