SMOTE
- class hana_ml.algorithms.pal.preprocessing.SMOTE(smote_amount=None, k_nearest_neighbours=None, minority_class=None, thread_ratio=None, random_seed=None, method=None, search_method=None, category_weights=None)
This class is to handle imbalanced dataset. Synthetic minority over-sampling technique (SMOTE) proposes an over-sampling approach in which the minority class is over-sampled by creating "synthetic" examples in "feature space".
- Parameters:
- smote_amountint, optional
Amount of SMOTE N%. E.g. 200 means 200%, so each minority class sample will generate 2 synthetic samples.
The synthetic samples are generated until the minority class sample amount matches the majority class sample amount.
- k_nearest_neighboursint, optional
Number of nearest neighbors (k).
Defaults to 1.
- minority_classstr, optional(deprecated)
Specifies the minority class value in dependent variable column.
All classes except majority class are re-sampled to match the majority class sample amount.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to 0.
- random_seedint, optional
Specifies the seed for random number generator.
0: Uses the current time (in seconds) as seed
Others: Uses the specified value as seed
Defaults to 0.
- methodint, optional(deprecated)
Searching method when finding K nearest neighbour.
0: Brute force searching
1: KD-tree searching
Defaults to 0.
- search_methodstr, optional
Specifies the searching method for finding the k nearest-neighbors.
'brute-force'
'kd-tree'
Defaults to 'brute-force'.
- category_weightsfloat, optional
Represents the weight of category attributes. The value must be greater or equal to 0.
Examples
>>> smote = SMOTE(smote_amount=200, k_nearest_neighbours=2, search_method='kd-tree') >>> res = smote.fit_transform(data=df, label='TYPE', minority_class=2) >>> res.collect()
- Attributes:
- None
Methods
fit_transform
(data[, label, minority_class, ...])Upsampling given datasets using SMOTE with specified configuration.
Get the model metrics.
Get the score metrics.
- fit_transform(data, label=None, minority_class=None, categorical_variable=None, variable_weight=None, key=None)
Upsampling given datasets using SMOTE with specified configuration.
- Parameters:
- dataDataFrame
Dataframe containing the data for upsampling via SMOTE.
- keystr, optional
Name of the ID column in
data
.If
data
is indexed by a single column, thenkey
defaults to that index column; otherwise no default value, sodata
is assumed having no ID column.
- labelstr, optional
Specifies the dependent variable by name.
If not specified, defaults to the last column in
data
.- minority_classstr/int, optional
Specifies the minority class value in dependent variable column.
If not specified, all but the majority classes are resampled to match the majority class sample amount.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- variable_weightdict, optional
Specifies the weights of variables participating in distance calculation in a dictionary, illustrated as follows:
{variable_name0 : value0, variable_name1 : value1, ...}.
The values must be no less than 0.
Weights default to 1 for variables not specified.
- Returns:
- DataFrame
SMOTE result, the same structure as defined in the input data.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the SMOTE class also inherits methods from PALBase class, please refer to PAL Base for more details.