hanaml.TomekLinks.Rd
hanaml.TomekLinks is an R wrapper
for SAP HANA PAL Tomek's links.
hanaml.TomekLinks(
data,
key = NULL,
features = NULL,
label = NULL,
thread.ratio = NULL,
sampling.strategy = NULL,
categorical.variable = NULL,
category.weights = NULL,
distance.level = NULL,
minkowski.power = NULL,
variable.weight = NULL,
algorithm = NULL
)
DataFrame
DataFrame containting the data.
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
character
Specifies the dependent variable by name
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
c("majority", "non-minority", "non-majority", "all"), optioanl
Specifies the samping strategy to resample the input dataset.
"majority"
resamples only the majority class
"non-minority"
resamples all classes but the minority class
"non-majority"
resamples all classes but the majority class
"all"
resamples all classes
Defaults to "majority".
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.
double, optional
Specifies the weight for categorical attributes.
Defaults to 0.707.
c("manhattan", "euclidean", "minkowski", "chebyshev", "cosine"), optional
Specifies the distance method between sample points.
Defaults to "euclidean".
double, optional
Specifies value of power in the definition of minkowski distance.
Valid only when distance.level
is set to "minkowski".
sample amount
named list/vector of double, optional
Specifies the weight of a variable(feature) participating in distance calculation.
The weight value must be non-negative.
Weight values default to 1 for non-specified variables(features).
c("brute-force", "kd-tree"), optional
Specifies the searching algorithms for finding the nearest neighbors.
"brute-force"
use brute-force method
"kd-tree"
use kd-tree searching
Defaults to "brute-force".
DataFrame
Return dataset after sampling.
The Output Table has the same structure as defined in the Input Table.
For a collection of sample points, a Tomek's link exists between two points if they are nearest neighbor mutually. The function performs under-sampling by removing Tomek’s links.
> data$Collect()
X1 X2 X3 TYPE
1 2 1 3.50 1
2 3 10 7.60 1
3 3 10 5.50 2
4 3 10 4.70 1
5 7 1000 8.50 1
6 8 1000 9.40 2
7 6 1000 0.34 1
8 8 999 7.40 2
9 7 999 3.50 1
10 6 1000 7.00 1
Call the function:
> result <- hanaml.TomekLinks(data=data,
label = "TYPE",
algorithm = "kd-tree",
sampling.strategy = "all",
thread.ratio = 0.1)
Results:
> result$Collect()
X1 X2 X3 TYPE
1 2 1 3.50 1
2 3 10 7.60 1
3 6 1000 0.34 1
4 8 999 7.40 2
5 7 999 3.50 1
6 6 1000 7.00 1