SMOTE
- class hana_ml.algorithms.pal.preprocessing.SMOTE(smote_amount=None, k_nearest_neighbours=None, minority_class=None, thread_ratio=None, random_seed=None, method=None, search_method=None, category_weights=None)
This class is to handle imbalanced dataset. Synthetic minority over-sampling technique (SMOTE) proposes an over-sampling approach in which the minority class is over-sampled by creating "synthetic" examples in "feature space".
- Parameters
- smote_amountint, optional
Amount of SMOTE N%. E.g. 200 means 200%, so each minority class sample will generate 2 synthetic samples.
The synthetic samples are generated until the minority class sample amount matches the majority class sample amount.
- k_nearest_neighboursint, optional
Number of nearest neighbors (k).
Defaults to 1.
- minority_classstr, optional(deprecated)
Specifies the minority class value in dependent variable column.
All classes except majority class are re-sampled to match the majority class sample amount.
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by this function.
The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.
Values outside the range [0, 1] will be ignored and this function heuristically determines the number of threads to use.
Default to 0.
- random_seedint, optional
Specifies the seed for random number generator.
0: Uses the current time (in seconds) as seed
Others: Uses the specified value as seed
Defaults to 0.
- methodint, optional(deprecated)
Searching method when finding K nearest neighbour.
0: Brute force searching
1: KD-tree searching
Defaults to 0.
- search_methodstr, optional
Specifies the searching method for finding the k nearest-neighbors.
'brute-force'
'kd-tree'
Defaults to 'brute-force'.
- category_weightsfloat, optional
Represents the weight of category attributes. The value must be greater or equal to 0.
Examples
>>> smote = SMOTE(smote_amount=200, k_nearest_neighbours=2, search_method='kd-tree') >>> res = smote.fit_transform(data=df, label = 'TYPE', minority_class=2)
- Attributes
- None
Methods
fit_transform
(data[, label, minority_class, ...])Upsampling given datasets using SMOTE with specified configuration.
- fit_transform(data, label=None, minority_class=None, categorical_variable=None, variable_weight=None, key=None)
Upsampling given datasets using SMOTE with specified configuration.
- Parameters
- dataDataFrame
Dataframe containing the data for upsampling via SMOTE.
- keystr, optional
Name of the ID column in
data
.If
data
is indexed by a single column, thenkey
defaults to that index column; otherwise no default value, sodata
is assumed having no ID column.
- labelstr
Specifies the dependent variable by name.
If not specified, defaults to the 1st column in
data
.- minority_classstr/int, optional
Specifies the minority class value in dependent variable column.
If not specified, all but the majority classes are resampled to match the majority class sample amount.
- categorical_variablestr/ListOfStrings, optional
Specifies the list of INTEGER columns that should be treated as categorical.
By default, only VARCHAR and NVARCHAR columns are treated as categorical, while numerical (i.e. INTEGER or DOUBLE) columns are treated as continuous.
No default value.
- variable_weightdict, optional
Specifies the weights of variables participating in distance calculation in a dictionary, illustrated as follows:
{variable_name0 : value0, variable_name1 : value1, ...}.
The values must be no less than 0.
Weights default to 1 for variables not specified.
- Returns
- DataFrame
SMOTE result, the same structure as defined in the input data.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.