SMOTE

class hana_ml.algorithms.pal.preprocessing.SMOTE(smote_amount=None, k_nearest_neighbours=None, minority_class=None, thread_ratio=None, random_seed=None, method=None, search_method=None, category_weights=None)

This class is to handle imbalanced dataset. Synthetic minority over-sampling technique (SMOTE) proposes an over-sampling approach in which the minority class is over-sampled by creating "synthetic" examples in "feature space".

Parameters:

smote_amountint, optional

Amount of SMOTE N%. E.g. 200 means 200%, so each minority class sample will generate 2 synthetic samples.

The synthetic samples are generated until the minority class sample amount matches the majority class sample amount.

k_nearest_neighboursint, optional

Number of nearest neighbors (k).

Defaults to 1.

minority_classstr, optional(deprecated)

Specifies the minority class value in dependent variable column.

All classes except majority class are re-sampled to match the majority class sample amount.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

random_seedint, optional

Specifies the seed for random number generator.

0: Uses the current time (in seconds) as seed

Others: Uses the specified value as seed

Defaults to 0.

methodint, optional(deprecated)

Searching method when finding K nearest neighbour.

0: Brute force searching

1: KD-tree searching

Defaults to 0.

search_methodstr, optional

Specifies the searching method for finding the k nearest-neighbors.

'brute-force'

'kd-tree'

Defaults to 'brute-force'.

category_weightsfloat, optional

Represents the weight of category attributes. The value must be greater or equal to 0.

Attributes:

None

Methods

fit_transform(data[, label, minority_class, ...])

Upsampling given datasets using SMOTE with specified configuration.

Examples

>>> smote = SMOTE(smote_amount=200,
                  k_nearest_neighbours=2,
                  search_method='kd-tree')
>>> res = smote.fit_transform(data=df,
                              label='TYPE',
                              minority_class=2)
>>> res.collect()

fit_transform(data, label=None, minority_class=None, categorical_variable=None, variable_weight=None, key=None)

Upsampling given datasets using SMOTE with specified configuration.

Parameters:

dataDataFrame

Dataframe containing the data for upsampling via SMOTE.

keystr, optional

Name of the ID column in data.

If data is indexed by a single column, then key defaults to that index column; otherwise no default value, so data is assumed having no ID column.

labelstr, optional

Specifies the dependent variable by name.

If not specified, defaults to the last column in data.

minority_classstr/int, optional

Specifies the minority class value in dependent variable column.

If not specified, all but the majority classes are resampled to match the majority class sample amount.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

variable_weightdict, optional

Specifies the weights of variables participating in distance calculation in a dictionary, illustrated as follows:

{variable_name0 : value0, variable_name1 : value1, ...}.

The values must be no less than 0.

Weights default to 1 for variables not specified.

Returns:

DataFrame

SMOTE result, the same structure as defined in the input data.

Inherited Methods from PALBase

Besides those methods mentioned above, the SMOTE class also inherits methods from PALBase class, please refer to PAL Base for more details.