Sampling
- class hana_ml.algorithms.pal.preprocessing.Sampling(method, interval=None, sampling_size=None, random_state=None, percentage=None)
This class is used to choose a small portion of the records as representatives.
- Parameters:
- methodstr
Specifies the sampling method.
Valid options include: 'first_n', 'middle_n', 'last_n', 'every_nth', 'simple_random_with_replacement', 'simple_random_without_replacement', 'systematic', 'stratified_with_replacement', 'stratified_without_replacement'.
For the random methods, the system time is used for the seed.
- intervalint, optional
The interval between two samples.
Only required when
method
is 'every_nth'.If this parameter is not specified, the
sampling_size
parameter will be used.- sampling_sizeint, optional
Number of the samples.
Default to 1.
- random_stateint, optional
Indicates the seed used to initialize the random number generator.
- It can be set to 0 or a positive value, where:
0: Uses the system time
Others: Uses the specified seed
Default to 0.
- percentagefloat, optional
Percentage of the samples.
Use this parameter when sampling_size is not set.
If both
sampling_size
andpercentage
are specified,percentage
takes precedence.Default to 0.1.
Examples
>>> smp = Sampling(method='every_nth', interval=5, sampling_size=8) >>> res = smp.fit_transform(data=df) >>> res.collect()
- Attributes:
- None
Methods
fit_transform
(data[, features])Sampling the input dataset under specified configuration.
Get the model metrics.
Get the score metrics.
- fit_transform(data, features=None)
Sampling the input dataset under specified configuration.
- Parameters:
- dataDataFrame
Input DataFrame.
- featuresstr/ListofStrings, optional
The column that is used to do the stratified sampling.
Only required when method is 'stratified_with_replacement', or 'stratified_without_replacement'.
Defaults to None.
- Returns:
- DataFrame
Sampling results, same structure as defined in the input DataFrame.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the Sampling class also inherits methods from PALBase class, please refer to PAL Base for more details.