SMOTETomek
- class hana_ml.algorithms.pal.preprocessing.SMOTETomek(smote_amount=None, k_nearest_neighbours=None, thread_ratio=None, random_seed=None, search_method=None, sampling_strategy=None, category_weights=None)
This class combines over-sampling using SMOTE and cleaning(under-sampling) using Tomek links.
- Parameters:
- smote_amountint, optional
Amount of SMOTE N%. E.g. 200 means 200%, so each minority class sample will generate 2 synthetic samples.
The synthetic samples are generated until the minority class sample amount matches the majority class sample amount.
- k_nearest_neighboursint, optional
Number of nearest neighbors (k).
Defaults to 1.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to 0.
- random_seedint, optional
Specifies the seed for random number generator.
0: Uses the current time (in second) as seed
Others: Uses the specified value as seed
Defaults to 0.
- search_methodstr, optional
Specifies the searching method when finding K nearest neighbour.
'brute-force'
'kd-tree'
Defaults to 'brute-force'.
- sampling_strategystr, optional
Specifies the classes targeted by resampling:
'majority' : resamples only the majority class
'non-minority' : resamples all classes except the minority class
'non-majority' : resamples all classes except the majority class
'all' : resamples all classes
Defaults to 'majority'.
- category_weightsfloat, optional
Represents the weight of category attributes. The value must be greater or equal to 0.
Examples
>>> smotetomek = SMOTETomek(smote_amount=200, k_nearest_neighbours=2, random_seed=2, search_method='kd-tree', sampling_strategy='all') >>> res = smotetomek.fit_transform(data=df, label='TYPE', minority_class=2) >>> res.collect()
- Attributes:
- None
Methods
fit_transform
(data[, label, minority_class, ...])Perform both over-sampling using SMOTE and under-sampling by removing Tomek's links on given datasets.
- fit_transform(data, label=None, minority_class=None, categorical_variable=None, variable_weight=None, key=None)
Perform both over-sampling using SMOTE and under-sampling by removing Tomek's links on given datasets.
- Parameters:
- dataDataFrame
Dataframe that contains the data for resampling via SMOTE and Tomek's links.
- keystr, optional
Specifies the name of ID column in
data
.If
data
is indexed by a single column, thekey
defaults to that index column; otherwise no default value, anddata
is considered having no ID column.- labelstr, optional
Specifies the dependent variable by name.
If not specified, defaults to the last column in
data
.- minority_classstr/int, optional
Specifies the minority class value in dependent variable column.
If not specified, all but the majority classes are resampled to match the majority class sample amount.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- variable_weightdict, optional
Specifies the weights of variables participating in distance calculation in a dictionary:
key : variable(column) name
value : weight for distance calculation
No default value.
- Returns:
- DataFrame
SMOTETomek result, structured the same as
data
exclusive of thekey
column(if there is one).
Inherited Methods from PALBase
Besides those methods mentioned above, the SMOTETomek class also inherits methods from PALBase class, please refer to PAL Base for more details.