TomekLinks

class hana_ml.algorithms.pal.preprocessing.TomekLinks(distance_level=None, minkowski_power=None, thread_ratio=None, search_method=None, sampling_strategy=None, category_weights=None)

This class is for performing under-sampling by removing Tomek's links.

Parameters:

distance_levelstr, optional

Specifies the distance method between train data and test data point.

'manhattan'

'euclidean'

'minkowski'

'chebyshev'

'cosine'

Defaults to 'euclidean'.

minkowski_powerfloat, optional

Specifies the value of power for Minkowski distance calculation.

Defaults to 3.

Valid only when distance_level is 'minkowski'.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

search_methodstr, optional

Specifies the searching method when finding K nearest neighbour.

'brute-force'

'kd-tree'

Defaults to 'brute-force'.

sampling_strategystr, optional

Specifies the classes targeted by resampling:

'majority' : resamples only the majority class

'non-minority' : resamples all classes except the minority class

'non-majority' : resamples all classes except the majority class

'all' : resamples all classes

Defaults to 'majority'

category_weightsfloat, optional

Specifies the weight for categorical attributes.

Defaults to 0.707 if not provided.

Attributes:

None

Methods

fit_transform(data[, key, label, ...])

Perform under-sampling on given datasets by removing Tomek's links.

Examples

>>> tomeklinks = TomekLinks(search_method='kd-tree',
                            sampling_strategy='majority')
>>> res = smotetomek.fit_transform(data=df, label='TYPE')

fit_transform(data, key=None, label=None, categorical_variable=None, variable_weight=None)

Perform under-sampling on given datasets by removing Tomek's links.

Parameters:

dataDataFrame

Dataframe that contains the training data.

keystr, optional

Specifies the name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it is assumed that data contains no ID column.

labelstr, optional

Specifies the dependent variable by name.

If not specified, defaults to the 1st non-key column in data.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

variable_weightdict, optional

Specifies the weights of variables participating in distance calculation in a dictionary:

key : variable(column) name

value : weight for distance calculation

No default value.

Returns:

DataFrame

Undersampled result, the same structure as defined in the input data.

Inherited Methods from PALBase

Besides those methods mentioned above, the TomekLinks class also inherits methods from PALBase class, please refer to PAL Base for more details.