hanaml.TomekLinks is an R wrapper for SAP HANA PAL Tomek's links.

hanaml.TomekLinks(
  data,
  key = NULL,
  features = NULL,
  label = NULL,
  thread.ratio = NULL,
  sampling.strategy = NULL,
  categorical.variable = NULL,
  category.weights = NULL,
  distance.level = NULL,
  minkowski.power = NULL,
  variable.weight = NULL,
  algorithm = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column. If not provided, the data is assumed to have no ID column.
No default value.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character
Specifies the dependent variable by name

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

sampling.strategy

c("majority", "non-minority", "non-majority", "all"), optioanl
Specifies the samping strategy to resample the input dataset.

  • "majority" resamples only the majority class

  • "non-minority" resamples all classes but the minority class

  • "non-majority" resamples all classes but the majority class

  • "all" resamples all classes

Defaults to "majority".

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

  • "VARCHAR" and "NVARCHAR": categorical

  • "INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

category.weights

double, optional
Specifies the weight for categorical attributes.
Defaults to 0.707.

distance.level

c("manhattan", "euclidean", "minkowski", "chebyshev", "cosine"), optional
Specifies the distance method between sample points.
Defaults to "euclidean".

minkowski.power

double, optional
Specifies value of power in the definition of minkowski distance.
Valid only when distance.level is set to "minkowski". sample amount

variable.weight

named list/vector of double, optional
Specifies the weight of a variable(feature) participating in distance calculation.
The weight value must be non-negative.
Weight values default to 1 for non-specified variables(features).

algorithm

c("brute-force", "kd-tree"), optional
Specifies the searching algorithms for finding the nearest neighbors.

  • "brute-force" use brute-force method

  • "kd-tree" use kd-tree searching

Defaults to "brute-force".

Value

  • DataFrame
    Return dataset after sampling. The Output Table has the same structure as defined in the Input Table.

Details

For a collection of sample points, a Tomek's link exists between two points if they are nearest neighbor mutually. The function performs under-sampling by removing Tomek’s links.

Examples


> data$Collect()
   X1   X2   X3 TYPE
1   2    1 3.50    1
2   3   10 7.60    1
3   3   10 5.50    2
4   3   10 4.70    1
5   7 1000 8.50    1
6   8 1000 9.40    2
7   6 1000 0.34    1
8   8  999 7.40    2
9   7  999 3.50    1
10  6 1000 7.00    1

Call the function:


> result <- hanaml.TomekLinks(data=data,
                              label = "TYPE",
                              algorithm = "kd-tree",
                              sampling.strategy = "all",
                              thread.ratio = 0.1)

Results:


> result$Collect()
   X1   X2   X3 TYPE
1   2    1 3.50    1
2   3   10 7.60    1
3   6 1000 0.34    1
4   8  999 7.40    2
5   7  999 3.50    1
6   6 1000 7.00    1