hanaml.KNNClassifier is an R wrapper for SAP HANA PAL KNN algorithm for
classification.
hanaml.KNNClassifier(
data = NULL,
key = NULL,
features = NULL,
label = NULL,
n.neighbors = NULL,
voting.type = NULL,
metric = NULL,
minkowski.power = NULL,
algorithm = NULL,
categorical.variable = NULL,
category.weights = NULL,
string.variable = NULL,
variable.weight = NULL,
resampling.method = NULL,
evaluation.metric = NULL,
fold.num = NULL,
repeat.times = NULL,
param.search.strategy = NULL,
random.search.times = NULL,
random.state = NULL,
timeout = NULL,
progress.indicator.id = NULL,
parameter.range = NULL,
parameter.values = NULL,
thread.ratio = NULL
)
Arguments
| data |
DataFrame
DataFrame containting the data.
|
| key |
character
Name of the ID column.
|
| features |
character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.
|
| label |
character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.
|
| n.neighbors |
integer, optional
Number of nearest neighbors.
Defaults to 1.
|
| voting.type |
("majority", "distance-weighted"), optional
Method used to vote for the most frequent label of the K
nearest neighbors.
Defaults to 'distance-weighted'.
|
| metric |
("manhattan", "euclidean", "minkowski", "chebyshev", "cosine"),
optional
Ways to compute the distance between data points.
Defaults to 'euclidean'.
|
| minkowski.power |
double, optional
When 'Minkowski' is used for metric, this parameter controls the value
of power.
Only valid when metric is "minkowski".
Defaults to 3.0.
|
| algorithm |
("brute-force", "kd-tree"), optional
Algorithm used to compute the nearest neighbors.
When metric is "cosine", using "kd-tree" searching will not have much help.
Defaults to 'brute-force'.
|
| categorical.variable |
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value. |
| category.weights |
double, optional
Represents the weight of category attributes.
Defaults to 0.707.
|
| string.variable |
character or list of character, optional
Indicates a string column storing non-categorical data.
Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.
Defaults to NULL.
|
| variable.weight |
named list, optional
Specifies the weight of a variable participating in distance calculation.
The value must be greater or equal to 0. Defaults to 1 for variables not specified.
Defaults to NULL.
|
| resampling.method |
character, optional
specifies the resampling values form below list. Valid options include:
'cv',
'stratified_cv',
'bootstrap',
'stratified_bootstrap'
If no value is specified for this parameter, neither model evaluation
nor parameter selection is activated.
|
| evaluation.metric |
c("accuracy", "f1_score"), optional
Specifies the evaluation metric for model evaluation or parameter selection.
If not specified, neither model evaluation
nor parameter selection is activated.
|
| fold.num |
integer, optional
Specifies the fold number for the cross-validation(cv).
Mandatory and valid only when resampling.method is 'cv' or 'stratified_cv'.
|
| repeat.times |
numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.
|
| param.search.strategy |
c('grid', 'random'), optional
Specifies the method to activate parameter selection.
If not specified, model selection shall not be triggered.
|
| random.search.times |
integer, optional
Specifies the number of times to randomly select candidate parameters for selection.
Mandatory and valid only when param.search.strategy is 'random'.
|
| random.state |
integer, optional
Specifies the seed for random number generation, where 0 means current system time
is used as seed, and other values are simply real seed values.
|
| timeout |
integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds.
No timeout when 0 is specified.
|
| progress.indicator.id |
character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.
|
| parameter.range |
named list/vector, optional
Specifies range of the following parameters for parameter selection:
minkowski.power, category.weights, n.neighbors.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(n.neighbors = c(3, 1, 6)).
If param.search.strategy is 'random', then step has no effect
and thus can be omitted.
|
| parameter.values |
named list/vector, optional
Specifies values of the following parameters for parameter selection:
metric, voting.type, minkowski.power, category.weights, n.neighbors.
|
| thread.ratio |
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
|
Value
A "KNNClassifier" object with the following attributes:
statistics: DataFrame Statistics for model-evaluation/parameter-selection.
Available only when model-evaluation/parameter selection is enabled.
optim.param: DataFrame Optimal parameters selected.
Available only when parameter-selection is enabled.
Details
K-Nearest Neighbor (KNN) is a memory-based classification or regression
method with no explicit training phase. For classificatin purpose, it assumes that
instances should have similar labels to their nearest neighbors.
Examples
Training data:
> df.train$Collect()
ID X1 X2 X3 TYPE
1 0 2 1 A 1
2 1 3 10 A 10
3 2 3 10 B 10
4 3 3 10 C 1
5 4 1 1000 C 1
6 5 1 1000 A 10
7 6 1 1000 B 99
8 7 1 999 A 99
9 8 1 999 B 10
10 9 1 1000 C 10
Call the function:
knc <- hanaml.KNNClassifier(data = df.train, key = "ID",
features = c("X1", "X2", "X3"),
label = "TYPE", n.neighbors = 3,
voting.type = "majority",
algorithm = "brute-force",
categorical.variable = c("X1"))
See also