Sampling
- class hana_ml.algorithms.pal.preprocessing.Sampling(method, interval=None, sampling_size=None, random_state=None, percentage=None)
This class is used to choose a small portion of the records as representatives.
- Parameters
- methodstr
Specifies the sampling method.
Valid options include: 'first_n', 'middle_n', 'last_n', 'every_nth', 'simple_random_with_replacement', 'simple_random_without_replacement', 'systematic', 'stratified_with_replacement', 'stratified_without_replacement'.
For the random methods, the system time is used for the seed.
- intervalint, optional
The interval between two samples.
Only required when
method
is 'every_nth'.If this parameter is not specified, the
sampling_size
parameter will be used.- sampling_sizeint, optional
Number of the samples.
Default to 1.
- random_stateint, optional
Indicates the seed used to initialize the random number generator.
- It can be set to 0 or a positive value, where:
0: Uses the system time
Others: Uses the specified seed
Default to 0.
- percentagefloat, optional
Percentage of the samples.
Use this parameter when sampling_size is not set.
If both
sampling_size
andpercentage
are specified,percentage
takes precedence.Default to 0.1.
Examples
Original data:
>>> df.collect().head(10) EMPNO GENDER INCOME 0 1 male 4000.5 1 2 male 5000.7 2 3 female 5100.8 3 4 male 5400.9 4 5 female 5500.2 5 6 male 5540.4 6 7 male 4500.9 7 8 female 6000.8 8 9 male 7120.8 9 10 female 8120.9
Apply the sampling function:
>>> smp = Sampling(method='every_nth', interval=5, sampling_size=8) >>> res = smp.fit_transform(data=df) >>> res.collect() EMPNO GENDER INCOME 0 5 female 5500.2 1 10 female 8120.9 2 15 male 9876.5 3 20 female 8705.7 4 25 female 8794.9
- Attributes
- None
Methods
fit_transform
(data[, features])Sampling the input dataset under specified configuration.
- fit_transform(data, features=None)
Sampling the input dataset under specified configuration.
- Parameters
- dataDataFrame
Input Dataframe.
- featuresstr/ListofStrings, optional
The column that is used to do the stratified sampling.
Only required when method is 'stratified_with_replacement', or 'stratified_without_replacement'.
Defaults to None.
- Returns
- DataFrame
Sampling results, same structure as defined in the Input DataFrame.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.