Sampling
- class hana_ml.algorithms.pal.preprocessing.Sampling(method, interval=None, sampling_size=None, random_state=None, percentage=None)
This class is used to choose a small portion of the records as representatives.
- Parameters:
- methodstr
Specifies the sampling method.
Valid options include: 'first_n', 'middle_n', 'last_n', 'every_nth', 'simple_random_with_replacement', 'simple_random_without_replacement', 'systematic', 'stratified_with_replacement', 'stratified_without_replacement'.
For the random methods, the system time is used for the seed.
- intervalint, optional
The interval between two samples.
Only required when
method
is 'every_nth'.If this parameter is not specified, the
sampling_size
parameter will be used.- sampling_sizeint, optional
Number of the samples.
Default to 1.
- random_stateint, optional
Indicates the seed used to initialize the random number generator.
- It can be set to 0 or a positive value, where:
0: Uses the system time
Others: Uses the specified seed
Default to 0.
- percentagefloat, optional
Percentage of the samples.
Use this parameter when sampling_size is not set.
If both
sampling_size
andpercentage
are specified,percentage
takes precedence.Default to 0.1.
Examples
Original data:
>>> df.collect().head(10) EMPNO GENDER INCOME 0 1 male 4000.5 1 2 male 5000.7 2 3 female 5100.8 3 4 male 5400.9 4 5 female 5500.2 5 6 male 5540.4 6 7 male 4500.9 7 8 female 6000.8 8 9 male 7120.8 9 10 female 8120.9
Apply the sampling function:
>>> smp = Sampling(method='every_nth', interval=5, sampling_size=8) >>> res = smp.fit_transform(data=df) >>> res.collect() EMPNO GENDER INCOME 0 5 female 5500.2 1 10 female 8120.9 2 15 male 9876.5 3 20 female 8705.7 4 25 female 8794.9
- Attributes:
- None
Methods
fit_transform
(data[, features])Sampling the input dataset under specified configuration.
- fit_transform(data, features=None)
Sampling the input dataset under specified configuration.
- Parameters:
- dataDataFrame
Input Dataframe.
- featuresstr/ListofStrings, optional
The column that is used to do the stratified sampling.
Only required when method is 'stratified_with_replacement', or 'stratified_without_replacement'.
Defaults to None.
- Returns:
- DataFrame
Sampling results, same structure as defined in the Input DataFrame.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
Inherited Methods from PALBase
Besides those methods mentioned above, the Sampling class also inherits methods from PALBase class, please refer to PAL Base for more details.