TomekLinks
- class hana_ml.algorithms.pal.preprocessing.TomekLinks(distance_level=None, minkowski_power=None, thread_ratio=None, search_method=None, sampling_strategy=None, category_weights=None)
This class is for performing under-sampling by removing Tomek's links.
- Parameters:
- distance_levelstr, optional
Specifies the distance method between train data and test data point.
'manhattan'
'euclidean'
'minkowski'
'chebyshev'
'cosine'
Defaults to 'euclidean'.
- minkowski_powerfloat, optional
Specifies the value of power for Minkowski distance calculation.
Defaults to 3.
Valid only when
distance_level
is 'minkowski'.- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to 0.
- search_methodstr, optional
Specifies the searching method when finding K nearest neighbour.
'brute-force'
'kd-tree'
Defaults to 'brute-force'.
- sampling_strategystr, optional
Specifies the classes targeted by resampling:
'majority' : resamples only the majority class
'non-minority' : resamples all classes except the minority class
'non-majority' : resamples all classes except the majority class
'all' : resamples all classes
Defaults to 'majority'
- category_weightsfloat, optional
Specifies the weight for categorical attributes.
Defaults to 0.707 if not provided.
Examples
>>> tomeklinks = TomekLinks(search_method='kd-tree', sampling_strategy='majority') >>> res = smotetomek.fit_transform(data=df, label='TYPE')
- Attributes:
- None
Methods
fit_transform
(data[, key, label, ...])Perform under-sampling on given datasets by removing Tomek's links.
- fit_transform(data, key=None, label=None, categorical_variable=None, variable_weight=None)
Perform under-sampling on given datasets by removing Tomek's links.
- Parameters:
- dataDataFrame
Dataframe that contains the training data.
- keystr, optional
Specifies the name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- labelstr, optional
Specifies the dependent variable by name.
If not specified, defaults to the 1st non-key column in
data
.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- variable_weightdict, optional
Specifies the weights of variables participating in distance calculation in a dictionary:
key : variable(column) name
value : weight for distance calculation
No default value.
- Returns:
- DataFrame
Undersampled result, the same structure as defined in the input data.
Inherited Methods from PALBase
Besides those methods mentioned above, the TomekLinks class also inherits methods from PALBase class, please refer to PAL Base for more details.