Sampling

class hana_ml.algorithms.pal.preprocessing.Sampling(method, interval=None, sampling_size=None, random_state=None, percentage=None)

This class is used to choose a small portion of the records as representatives.

Parameters:

methodstr

Specifies the sampling method.

Valid options include: 'first_n', 'middle_n', 'last_n', 'every_nth', 'simple_random_with_replacement', 'simple_random_without_replacement', 'systematic', 'stratified_with_replacement', 'stratified_without_replacement'.

For the random methods, the system time is used for the seed.

intervalint, optional

The interval between two samples.

Only required when method is 'every_nth'.

If this parameter is not specified, the sampling_size parameter will be used.

sampling_sizeint, optional

Number of the samples.

Default to 1.

random_stateint, optional

Indicates the seed used to initialize the random number generator.

It can be set to 0 or a positive value, where:

0: Uses the system time
Others: Uses the specified seed

Default to 0.

percentagefloat, optional

Percentage of the samples.

Use this parameter when sampling_size is not set.

If both sampling_size and percentage are specified, percentage takes precedence.

Default to 0.1.

Examples

Original data:

>>> df.collect().head(10)
    EMPNO  GENDER  INCOME
     1    male  4000.5
     2    male  5000.7
     3  female  5100.8
     4    male  5400.9
     5  female  5500.2
     6    male  5540.4
     7    male  4500.9
     8  female  6000.8
     9    male  7120.8
    10  female  8120.9

Apply the sampling function:

>>> smp = Sampling(method='every_nth', interval=5, sampling_size=8)
>>> res = smp.fit_transform(data=df)
>>> res.collect()
   EMPNO  GENDER  INCOME
0      5  female  5500.2
1     10  female  8120.9
2     15    male  9876.5
3     20  female  8705.7
4     25  female  8794.9

Attributes:

None

Methods

fit_transform(data[, features])

Sampling the input dataset under specified configuration.

fit_transform(data, features=None)

Sampling the input dataset under specified configuration.

Parameters:

dataDataFrame

Input Dataframe.

featuresstr/ListofStrings, optional

The column that is used to do the stratified sampling.

Only required when method is 'stratified_with_replacement', or 'stratified_without_replacement'.

Defaults to None.

Returns:

DataFrame: Sampling results, same structure as defined in the Input DataFrame.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the Sampling class also inherits methods from PALBase class, please refer to PAL Base for more details.