hanaml.KNNClassifier is an R wrapper for SAP HANA PAL KNN algorithm for classification.

hanaml.KNNClassifier(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  n.neighbors = NULL,
  voting.type = NULL,
  metric = NULL,
  minkowski.power = NULL,
  algorithm = NULL,
  categorical.variable = NULL,
  category.weights = NULL,
  string.variable = NULL,
  variable.weight = NULL,
  resampling.method = NULL,
  evaluation.metric = NULL,
  fold.num = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  random.state = NULL,
  timeout = NULL,
  progress.indicator.id = NULL,
  parameter.range = NULL,
  parameter.values = NULL,
  thread.ratio = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

n.neighbors

integer, optional
Number of nearest neighbors.
Defaults to 1.

voting.type

("majority", "distance-weighted"), optional
Method used to vote for the most frequent label of the K nearest neighbors.
Defaults to 'distance-weighted'.

metric

("manhattan", "euclidean", "minkowski", "chebyshev", "cosine"), optional
Ways to compute the distance between data points.
Defaults to 'euclidean'.

minkowski.power

double, optional
When 'Minkowski' is used for metric, this parameter controls the value of power.
Only valid when metric is "minkowski". Defaults to 3.0.

algorithm

("brute-force", "kd-tree"), optional
Algorithm used to compute the nearest neighbors.
When metric is "cosine", using "kd-tree" searching will not have much help.
Defaults to 'brute-force'.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

  • "VARCHAR" and "NVARCHAR": categorical

  • "INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

category.weights

double, optional
Represents the weight of category attributes.
Defaults to 0.707.

string.variable

character or list of character, optional
Indicates a string column storing non-categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.
Defaults to NULL.

variable.weight

named list, optional
Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified.
Defaults to NULL.

resampling.method

character, optional
specifies the resampling values form below list. Valid options include:
'cv', 'stratified_cv', 'bootstrap', 'stratified_bootstrap'
If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.

evaluation.metric

c("accuracy", "f1_score"), optional
Specifies the evaluation metric for model evaluation or parameter selection.
If not specified, neither model evaluation nor parameter selection is activated.

fold.num

integer, optional
Specifies the fold number for the cross-validation(cv). Mandatory and valid only when resampling.method is 'cv' or 'stratified_cv'.

repeat.times

numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.

param.search.strategy

c('grid', 'random'), optional
Specifies the method to activate parameter selection. If not specified, model selection shall not be triggered.

random.search.times

integer, optional
Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid only when param.search.strategy is 'random'.

random.state

integer, optional
Specifies the seed for random number generation, where 0 means current system time is used as seed, and other values are simply real seed values.

timeout

integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds. No timeout when 0 is specified.

progress.indicator.id

character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.

parameter.range

named list/vector, optional
Specifies range of the following parameters for parameter selection:
minkowski.power, category.weights, n.neighbors.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(n.neighbors = c(3, 1, 6)).
If param.search.strategy is 'random', then step has no effect and thus can be omitted.

parameter.values

named list/vector, optional
Specifies values of the following parameters for parameter selection:
metric, voting.type, minkowski.power, category.weights, n.neighbors.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

Value

A "KNNClassifier" object with the following attributes:

  • statistics: DataFrame Statistics for model-evaluation/parameter-selection. Available only when model-evaluation/parameter selection is enabled.

  • optim.param: DataFrame Optimal parameters selected. Available only when parameter-selection is enabled.

Details

K-Nearest Neighbor (KNN) is a memory-based classification or regression method with no explicit training phase. For classificatin purpose, it assumes that instances should have similar labels to their nearest neighbors.

Examples

Training data:

> df.train$Collect()
   ID X1   X2 X3  TYPE
1   0  2    1  A     1
2   1  3   10  A    10
3   2  3   10  B    10
4   3  3   10  C     1
5   4  1 1000  C     1
6   5  1 1000  A    10
7   6  1 1000  B    99
8   7  1  999  A    99
9   8  1  999  B    10
10  9  1 1000  C    10

Call the function:

knc <- hanaml.KNNClassifier(data = df.train, key = "ID",
                            features = c("X1", "X2", "X3"),
                            label = "TYPE", n.neighbors = 3,
                            voting.type = "majority",
                            algorithm = "brute-force",
                            categorical.variable = c("X1"))

See also