SMOTETomek
- class hana_ml.algorithms.pal.preprocessing.SMOTETomek(smote_amount=None, k_nearest_neighbours=None, thread_ratio=None, random_seed=None, search_method=None, sampling_strategy=None, category_weights=None)
This class combines over-sampling using SMOTE and cleaning(under-sampling) using Tomek links.
- Parameters:
- smote_amountint, optional
Amount of SMOTE N%. E.g. 200 means 200%, so each minority class sample will generate 2 synthetic samples.
The synthetic samples are generated until the minority class sample amount matches the majority class sample amount.
- k_nearest_neighboursint, optional
Number of nearest neighbors (k).
Defaults to 1.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to 0.
- random_seedint, optional
Specifies the seed for random number generator.
0: Uses the current time (in second) as seed
Others: Uses the specified value as seed
Defaults to 0.
- search_methodstr, optional
Specifies the searching method when finding K nearest neighbour.
'brute-force'
'kd-tree'
Defaults to 'brute-force'.
- sampling_strategystr, optional
Specifies the classes targeted by resampling:
'majority' : resamples only the majority class
'non-minority' : resamples all classes except the minority class
'non-majority' : resamples all classes except the majority class
'all' : resamples all classes
Defaults to 'majority'.
- category_weightsfloat, optional
Represents the weight of category attributes. The value must be greater or equal to 0.
Examples
>>> smotetomek = SMOTETomek(smote_amount=200, k_nearest_neighbours=2, random_seed=2, search_method='kd-tree', sampling_strategy='all') >>> res = smotetomek.fit_transform(data=df, label='TYPE', minority_class=2) >>> res.collect()
- Attributes:
- None
Methods
fit_transform
(data[, label, minority_class, ...])Perform both over-sampling using SMOTE and under-sampling by removing Tomek's links on given datasets.
Get the model metrics.
Get the score metrics.
- fit_transform(data, label=None, minority_class=None, categorical_variable=None, variable_weight=None, key=None)
Perform both over-sampling using SMOTE and under-sampling by removing Tomek's links on given datasets.
- Parameters:
- dataDataFrame
Dataframe that contains the data for resampling via SMOTE and Tomek's links.
- keystr, optional
Specifies the name of ID column in
data
.If
data
is indexed by a single column, thekey
defaults to that index column; otherwise no default value, anddata
is considered having no ID column.- labelstr, optional
Specifies the dependent variable by name.
If not specified, defaults to the last column in
data
.- minority_classstr/int, optional
Specifies the minority class value in dependent variable column.
If not specified, all but the majority classes are resampled to match the majority class sample amount.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- variable_weightdict, optional
Specifies the weights of variables participating in distance calculation in a dictionary:
key : variable(column) name
value : weight for distance calculation
No default value.
- Returns:
- DataFrame
SMOTETomek result, structured the same as
data
exclusive of thekey
column(if there is one).
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the SMOTETomek class also inherits methods from PALBase class, please refer to PAL Base for more details.