Sampling

class hana_ml.algorithms.pal.preprocessing.Sampling(method, interval=None, sampling_size=None, random_state=None, percentage=None)

This class is used to choose a small portion of the records as representatives.

Parameters:

methodstr

Specifies the sampling method.

Valid options include: 'first_n', 'middle_n', 'last_n', 'every_nth', 'simple_random_with_replacement', 'simple_random_without_replacement', 'systematic', 'stratified_with_replacement', 'stratified_without_replacement'.

For the random methods, the system time is used for the seed.

intervalint, optional

The interval between two samples.

Only required when method is 'every_nth'.

If this parameter is not specified, the sampling_size parameter will be used.

sampling_sizeint, optional

Number of the samples.

Default to 1.

random_stateint, optional

Indicates the seed used to initialize the random number generator.

It can be set to 0 or a positive value, where:

0: Uses the system time
Others: Uses the specified seed

Default to 0.

percentagefloat, optional

Percentage of the samples.

Use this parameter when sampling_size is not set.

If both sampling_size and percentage are specified, percentage takes precedence.

Default to 0.1.

Examples

>>> smp = Sampling(method='every_nth', interval=5, sampling_size=8)
>>> res = smp.fit_transform(data=df)
>>> res.collect()

Attributes:

None

Methods

`fit_transform`(data[, features])	Sampling the input dataset under specified configuration.
`get_model_metrics`()	Get the model metrics.
`get_score_metrics`()	Get the score metrics.

fit_transform(data, features=None)

Sampling the input dataset under specified configuration.

Parameters:

dataDataFrame

Input DataFrame.

featuresstr/ListofStrings, optional

The column that is used to do the stratified sampling.

Only required when method is 'stratified_with_replacement', or 'stratified_without_replacement'.

Defaults to None.

Returns:

DataFrame: Sampling results, same structure as defined in the input DataFrame.

get_model_metrics()

Get the model metrics.

Returns:

DataFrame: The model metrics.

get_score_metrics()

Get the score metrics.

Returns:

DataFrame: The score metrics.

Inherited Methods from PALBase

Besides those methods mentioned above, the Sampling class also inherits methods from PALBase class, please refer to PAL Base for more details.