TomekLinks
- class hana_ml.algorithms.pal.preprocessing.TomekLinks(distance_level=None, minkowski_power=None, thread_ratio=None, search_method=None, sampling_strategy=None, category_weights=None)
This class is for performing under-sampling by removing Tomek's links.
- Parameters:
- distance_levelstr, optional
Specifies the distance method between train data and test data point.
'manhattan'
'euclidean'
'minkowski'
'chebyshev'
'cosine'
Defaults to 'euclidean'.
- minkowski_powerfloat, optional
Specifies the value of power for Minkowski distance calculation.
Defaults to 3.
Valid only when
distance_level
is 'minkowski'.- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Default to 0.
- search_methodstr, optional
Specifies the searching method when finding K nearest neighbour.
'brute-force'
'kd-tree'
Defaults to 'brute-force'.
- sampling_strategystr, optional
Specifies the classes targeted by resampling:
'majority' : resamples only the majority class
'non-minority' : resamples all classes except the minority class
'non-majority' : resamples all classes except the majority class
'all' : resamples all classes
Defaults to 'majority'
- category_weightsfloat, optional
Specifies the weight for categorical attributes.
Defaults to 0.707 if not provided.
Examples
>>> tomeklinks = TomekLinks(search_method='kd-tree', sampling_strategy='majority') >>> res = smotetomek.fit_transform(data=df, label='TYPE')
- Attributes:
- None
Methods
fit_transform
(data[, key, label, ...])Perform under-sampling on given datasets by removing Tomek's links.
Get the model metrics.
Get the score metrics.
- fit_transform(data, key=None, label=None, categorical_variable=None, variable_weight=None)
Perform under-sampling on given datasets by removing Tomek's links.
- Parameters:
- dataDataFrame
Dataframe that contains the training data.
- keystr, optional
Specifies the name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- labelstr, optional
Specifies the dependent variable by name.
If not specified, defaults to the 1st non-key column in
data
.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- variable_weightdict, optional
Specifies the weights of variables participating in distance calculation in a dictionary:
key : variable(column) name
value : weight for distance calculation
No default value.
- Returns:
- DataFrame
Undersampled result, the same structure as defined in the input data.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the TomekLinks class also inherits methods from PALBase class, please refer to PAL Base for more details.