TomekLinks¶
- class hana_ml.algorithms.pal.preprocessing.TomekLinks(distance_level=None, minkowski_power=None, thread_ratio=None, search_method=None, sampling_strategy=None, category_weights=None)¶
This class is for performing under-sampling by removing Tomek's links.
- Parameters
- distance_levelstr, optional
Specifies the distance method between train data and test data point.
'manhattan'
'euclidean'
'minkowski'
'chebyshev'
'cosine'
Defaults to 'euclidean'.
- minkowski_powerfloat, optional
Specifies the value of power for Minkowski distance calculation.
Defaults to 3.
Valid only when
distance_levelis 'minkowski'.- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to 0.
- search_methodstr, optional
Specifies the searching method when finding K nearest neighbour.
'brute-force'
'kd-tree'
Defaults to 'brute-force'.
- sampling_strategystr, optional
Specifies the classes targeted by resampling:
'majority' : resamples only the majority class
'non-minority' : resamples all classes except the minority class
'non-majority' : resamples all classes except the majority class
'all' : resamples all classes
Defaults to 'majority'
- category_weightsfloat, optional
Specifies the weight for categorical attributes.
Defaults to 0.707 if not provided.
- Attributes
- None
Methods
fit_transform(data[, key, label, ...])Perform under-sampling on given datasets by removing Tomek's links.
Examples
>>> tomeklinks = TomekLinks(search_method='kd-tree', sampling_strategy='majority') >>> res = smotetomek.fit_transform(data=df, label='TYPE')
- fit_transform(data, key=None, label=None, categorical_variable=None, variable_weight=None)¶
Perform under-sampling on given datasets by removing Tomek's links.
- Parameters
- dataDataFrame
Dataframe that contains the training data.
- keystr, optional
Specifies the name of the ID column.
If
keyis not provided, then:if
datais indexed by a single column, thenkeydefaults to that index column;otherwise, it is assumed that
datacontains no ID column.
- labelstr, optional
Specifies the dependent variable by name.
If not specified, defaults to the 1st non-key column in
data.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- variable_weightdict, optional
Specifies the weights of variables participating in distance calculation in a dictionary:
key : variable(column) name
value : weight for distance calculation
No default value.
- Returns
- DataFrame
Undersampled result, the same structure as defined in the input data.