hanaml.TomekLinks is an R wrapper
for SAP HANA PAL Tomek's links.
hanaml.TomekLinks(
data,
key = NULL,
features = NULL,
label = NULL,
thread.ratio = NULL,
sampling.strategy = NULL,
categorical.variable = NULL,
category.weights = NULL,
distance.level = NULL,
minkowski.power = NULL,
variable.weight = NULL,
algorithm = NULL
)
Arguments
| data |
DataFrame
DataFrame containting the data.
|
| key |
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
|
| features |
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
|
| label |
character
Specifies the dependent variable by name
|
| thread.ratio |
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
|
| sampling.strategy |
c("majority", "non-minority", "non-majority", "all"), optioanl
Specifies the samping strategy to resample the input dataset.
"majority" resamples only the majority class
"non-minority" resamples all classes but the minority class
"non-majority" resamples all classes but the majority class
"all" resamples all classes
Defaults to "majority". |
| categorical.variable |
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value. |
| category.weights |
double, optional
Specifies the weight for categorical attributes.
Defaults to 0.707.
|
| distance.level |
c("manhattan", "euclidean", "minkowski", "chebyshev", "cosine"), optional
Specifies the distance method between sample points.
Defaults to "euclidean".
|
| minkowski.power |
double, optional
Specifies value of power in the definition of minkowski distance.
Valid only when distance.level is set to "minkowski".
sample amount
|
| variable.weight |
named list/vector of double, optional
Specifies the weight of a variable(feature) participating in distance calculation.
The weight value must be non-negative.
Weight values default to 1 for non-specified variables(features).
|
| algorithm |
c("brute-force", "kd-tree"), optional
Specifies the searching algorithms for finding the nearest neighbors.
Defaults to "brute-force". |
Value
Details
For a collection of sample points, a Tomek's link exists between two points if
they are nearest neighbor mutually.
The function performs under-sampling by removing Tomek’s links.
Examples
> data$Collect()
X1 X2 X3 TYPE
1 2 1 3.50 1
2 3 10 7.60 1
3 3 10 5.50 2
4 3 10 4.70 1
5 7 1000 8.50 1
6 8 1000 9.40 2
7 6 1000 0.34 1
8 8 999 7.40 2
9 7 999 3.50 1
10 6 1000 7.00 1
Call the function:
> result <- hanaml.TomekLinks(data=data,
label = "TYPE",
algorithm = "kd-tree",
sampling.strategy = "all",
thread.ratio = 0.1)
Results:
> result$Collect()
X1 X2 X3 TYPE
1 2 1 3.50 1
2 3 10 7.60 1
3 6 1000 0.34 1
4 8 999 7.40 2
5 7 999 3.50 1
6 6 1000 7.00 1
See also