hanaml.SMOTETomek.Rd
hanaml.SMOTETomek is a R wrapper
for SAP HANA PAL SMOTETomek.
hanaml.SMOTETomek(
data,
key = NULL,
features = NULL,
label = NULL,
thread.ratio = NULL,
random.state = NULL,
n.neighbors = NULL,
minority.class = NULL,
smote.amount = NULL,
algorithm = NULL,
sampling.strategy = NULL,
categorical.variable = NULL,
variable.weight = NULL,
category.weights = NULL
)
DataFrame
DataFrame containting the data.
character, optional
Name of the ID column.
If not provided, the data is assumed to have no ID column.
No default value.
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
character
Specifies the dependent variable by name.
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
double, optional
Specifies the seed for random number generation, where 0 means
current system time is used as seed, and other values are simply
real seed values. Defaults to 0.
integer, optional
Specifies the number of nearest neighbours.
Defaults to 1.
character, optional
Specifies the targeted minority class value in the dependent
variable column. When minority.class is not specified, all
classes except majority class will be re-sampled to match the
majority class sample amount.
integer, optional
only valid when minority.class is presented by user.
Specifies the number of nearest neighbors.
When not specified, the algorithm will generated samples until
the minority class sample amount matches the majority class
sample amount
c("brute-force", "kd-tree"), optional
Specifies the searching algorithms for finding the nearest neighbors.
"brute-force"
use brute-force method
"kd-tree"
use kd-tree searching
Defaults to "brute-force".
c("majority", "non-minority", "non-majority", "all"), optioanl
Specifies the sampling strategy to resample the input dataset.
"majority"
resamples only the majority class
"non-minority"
resamples all classes but the minority one
"non-majority"
resamples all classes but the majority one
"all"
resamples all classes
Defaults to "majority".
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.
numeric, optional
Specifies the weight of a variable participating in distance calculation.
The value must be greater or equal to 0.
Defaults to 1 for variables not specified.
numeric, optional
Represents the weight of category attributes.
The value must be greater or equal to 0.
Defaults to 0.707.
DataFrame
DataFrame for the output table.
The output table has the same structure as defined in the input table.
SMOTETomek combines over-sampling and under-sampling using SMOTE and Tomek links.
> data.df$Collect()
X1 X2 X3 TYPE
1 2 1 3.50 1
2 3 10 7.60 1
3 3 10 5.50 2
4 3 10 4.70 1
5 7 1000 8.50 1
6 8 1000 9.40 2
7 6 1000 0.34 1
8 8 999 7.40 2
9 7 999 3.50 1
10 6 1000 7.00 1
Call the function:
> result = hanaml.SMOTETomek(data=data.df, thread.ratio = 1, random.state = 1,
label = "TYPE", minority.class = "2",
smote.amount = 200, n.neighbors = 2,
algorithm = "kd-tree", sampling.strategy = "all")
Results:
> result$Collect()
X1 X2 X3 TYPE
1 2 1.0000 3.500000 1
2 3 10.0000 7.600000 1
3 3 10.0000 5.500000 2
4 3 10.0000 4.700000 1
5 7 1000.0000 8.500000 1
6 8 1000.0000 9.400000 2
7 6 1000.0000 0.340000 1
8 8 999.0000 7.400000 2
9 7 999.0000 3.500000 1
10 6 1000.0000 7.000000 1
11 8 973.1091 7.350260 2
12 7 888.0711 8.959068 2
13 8 999.0567 7.513491 2
14 8 999.5123 8.424672 2
15 4 131.5100 5.733437 2
16 5 340.7139 6.135345 2