TomekLinks

class hana_ml.algorithms.pal.preprocessing.TomekLinks(distance_level=None, minkowski_power=None, thread_ratio=None, search_method=None, sampling_strategy=None, category_weights=None)

This class is for performing under-sampling by removing Tomek's links.

Parameters:

distance_levelstr, optional

Specifies the distance method between train data and test data point.

'manhattan'

'euclidean'

'minkowski'

'chebyshev'

'cosine'

Defaults to 'euclidean'.

minkowski_powerfloat, optional

Specifies the value of power for Minkowski distance calculation.

Defaults to 3.

Valid only when distance_level is 'minkowski'.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range [0, 1] will be ignored and this function heuristically determines the number of threads to use.

Default to 0.

search_methodstr, optional

Specifies the searching method when finding K nearest neighbour.

'brute-force'

'kd-tree'

Defaults to 'brute-force'.

sampling_strategystr, optional

Specifies the classes targeted by resampling:

'majority' : resamples only the minority class

'non-minority' : resamples all classes except the minority class

'non-majority' : resamples all classes except the majority class

'all' : resamples all classes

Defaults to 'majority'

category_weightsfloat, optional

Specifies the weight for categorical attributes.

Defaults to 0.707 if not provided.

Examples

>>> tomeklinks = TomekLinks(search_method='kd-tree',
                            sampling_strategy='majority')
>>> res = smotetomek.fit_transform(data=df, label='TYPE')

Attributes:

None

Methods

fit_transform(data[, key, label, ...])

Perform under-sampling on given datasets by removing Tomek's links.

fit_transform(data, key=None, label=None, categorical_variable=None, variable_weight=None)

Perform under-sampling on given datasets by removing Tomek's links.

Parameters:

dataDataFrame

Dataframe that contains the training data.

keystr, optional

Specifies the name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it is assumed that data contains no ID column.

labelstr, optional

Specifies the dependent variable by name.

If not specified, defaults to the 1st non-key column in data.

categorical_variablestr/ListOfStrings, optional

Specifies the list of INTEGER columns that should be treated as categorical.

By default, only VARCHAR and NVARCHAR columns are treated as categorical, while numerical (i.e. INTEGER or DOUBLE) columns are treated as continuous.

No default value.

variable_weightdict, optional

Specifies the weights of variables participating in distance calculation in a dictionary:

key : variable(column) name

value : weight for distance calculation

No default value.

Returns:

DataFrame

Undersampled result, the same structure as defined in the input data.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the TomekLinks class also inherits methods from PALBase class, please refer to PAL Base for more details.