SMOTETomek

class hana_ml.algorithms.pal.preprocessing.SMOTETomek(smote_amount=None, k_nearest_neighbours=None, thread_ratio=None, random_seed=None, search_method=None, sampling_strategy=None, category_weights=None)

This class combines over-sampling using SMOTE and cleaning(under-sampling) using Tomek links.

Parameters
smote_amountint, optional

Amount of SMOTE N%. E.g. 200 means 200%, so each minority class sample will generate 2 synthetic samples.

The synthetic samples are generated until the minority class sample amount matches the majority class sample amount.

k_nearest_neighboursint, optional

Number of nearest neighbors (k).

Defaults to 1.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range [0, 1] will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

random_seedint, optional

Specifies the seed for random number generator.

  • 0: Uses the current time (in second) as seed

  • Others: Uses the specified value as seed

Defaults to 0.

search_methodstr, optional

Specifies the searching method when finding K nearest neighbour.

  • 'brute-force'

  • 'kd-tree'

Defaults to 'brute-force'.

sampling_strategystr, optional

Specifies the classes targeted by resampling:

  • 'majority' : resamples only the majority class

  • 'non-minority' : resamples all classes except the minority class

  • 'non-majority' : resamples all classes except the majority class

  • 'all' : resamples all classes

Defaults to 'majority'.

category_weightsfloat, optional

Represents the weight of category attributes. The value must be greater or equal to 0.

Examples

>>> smotetomek = SMOTETomek(smote_amount=200,
                            k_nearest_neighbours=2,
                            random_seed=2,
                            search_method='kd-tree',
                            sampling_strategy='all')
>>> res = smotetomek.fit_transform(data=df, label='TYPE', minority_class=2)
Attributes
None

Methods

fit_transform(data[, label, minority_class, ...])

Perform both over-sampling using SMOTE and under-sampling by removing Tomek's links on given datasets.

fit_transform(data, label=None, minority_class=None, categorical_variable=None, variable_weight=None, key=None)

Perform both over-sampling using SMOTE and under-sampling by removing Tomek's links on given datasets.

Parameters
dataDataFrame

Dataframe that contains the data for resampling via SMOTE and Tomek's links.

keystr, optional

Specifies the name of ID column in data.

If data is indexed by a single column, the key defaults to that index column; otherwise no default value, and data is considered having no ID column.

labelstr, optional

Specifies the dependent variable by name.

If not specified, defaults to the 1st column in data.

minority_classstr/int, optional

Specifies the minority class value in dependent variable column.

If not specified, all but the majority classes are resampled to match the majority class sample amount.

categorical_variablestr/ListOfStrings, optional

Specifies the list of INTEGER columns that should be treated as categorical.

By default, only VARCHAR and NVARCHAR columns are treated as categorical, while numerical (i.e. INTEGER or DOUBLE) columns are treated as continuous.

No default value.

variable_weightdict, optional

Specifies the weights of variables participating in distance calculation in a dictionary:

  • key : variable(column) name

  • value : weight for distance calculation

No default value.

Returns
DataFrame

SMOTETomek result, structured the same as data exclusive of the key column(if there is one).

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the SMOTETomek class also inherits methods from PALBase class, please refer to PAL Base for more details.