K-Nearest Neighbor(KNN) Classifier

hanaml.KNNClassifier is an R wrapper for SAP HANA PAL KNN algorithm for classification.

hanaml.KNNClassifier(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  n.neighbors = NULL,
  voting.type = NULL,
  metric = NULL,
  minkowski.power = NULL,
  algorithm = NULL,
  categorical.variable = NULL,
  category.weights = NULL,
  string.variable = NULL,
  variable.weight = NULL,
  resampling.method = NULL,
  evaluation.metric = NULL,
  fold.num = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  random.state = NULL,
  timeout = NULL,
  progress.indicator.id = NULL,
  parameter.range = NULL,
  parameter.values = NULL,
  thread.ratio = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
key	`character` Name of the ID column.
features	`character of list of characters, optional` Name of feature columns. If not provided, it defaults all non-key, non-label columns of data.
label	`character, optional` Name of the column which specifies the dependent variable. Defaults to the last column of data if not provided.
n.neighbors	`integer, optional` Number of nearest neighbors. Defaults to 1.
voting.type	`("majority", "distance-weighted"), optional` Method used to vote for the most frequent label of the K nearest neighbors. Defaults to 'distance-weighted'.
metric	`("manhattan", "euclidean", "minkowski", "chebyshev", "cosine"), optional` Ways to compute the distance between data points. Defaults to 'euclidean'.
minkowski.power	`double, optional` When 'Minkowski' is used for metric, this parameter controls the value of power. Only valid when metric is "minkowski". Defaults to 3.0.
algorithm	`("brute-force", "kd-tree"), optional` Algorithm used to compute the nearest neighbors. When metric is "cosine", using "kd-tree" searching will not have much help. Defaults to 'brute-force'.
categorical.variable	`character or list/vector of characters, optional` Indicates features should be treated as categorical variable. The default behavior is dependent on what input is given: "VARCHAR" and "NVARCHAR": categorical "INTEGER" and "DOUBLE": continuous. VALID only for variables of "INTEGER" type, omitted otherwise. No default value.
category.weights	`double, optional` Represents the weight of category attributes. Defaults to 0.707.
string.variable	`character or list of character, optional` Indicates a string column storing non-categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column. Defaults to NULL.
variable.weight	`named list, optional` Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified. Defaults to NULL.
resampling.method	`character, optional` specifies the resampling values form below list. Valid options include: `'cv', 'stratified_cv', 'bootstrap', 'stratified_bootstrap'` If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.
evaluation.metric	`c("accuracy", "f1_score"), optional` Specifies the evaluation metric for model evaluation or parameter selection. If not specified, neither model evaluation nor parameter selection is activated.
fold.num	`integer, optional` Specifies the fold number for the cross-validation(cv). Mandatory and valid only when `resampling.method` is 'cv' or 'stratified_cv'.
repeat.times	`numeric, optional` Specifies the number of repeat times for resampling. Defaults to 1.
param.search.strategy	`c('grid', 'random'), optional` Specifies the method to activate parameter selection. If not specified, model selection shall not be triggered.
random.search.times	`integer, optional` Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid only when `param.search.strategy` is 'random'.
random.state	`integer, optional` Specifies the seed for random number generation, where 0 means current system time is used as seed, and other values are simply real seed values.
timeout	`integer, optional` Specifies maximum running time for model evaluation or parameter selection in seconds. No timeout when 0 is specified.
progress.indicator.id	`character, optional` Sets an ID of progress indicator for model evaluation or parameter selection. No progress indicator is active if no value is provided.
parameter.range	`named list/vector, optional` Specifies range of the following parameters for parameter selection: `minkowski.power, category.weights, n.neighbors`. Parameter range should be specified by 3 numbers in the form of c(start, step, end). Examples: parameter.range <- list(n.neighbors = c(3, 1, 6)). If `param.search.strategy` is 'random', then step has no effect and thus can be omitted.
parameter.values	`named list/vector, optional` Specifies values of the following parameters for parameter selection: `metric, voting.type, minkowski.power, category.weights, n.neighbors`.
thread.ratio	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored. Defaults to 0.

Value

A "KNNClassifier" object with the following attributes:

statistics: DataFrame Statistics for model-evaluation/parameter-selection. Available only when model-evaluation/parameter selection is enabled.
optim.param: DataFrame Optimal parameters selected. Available only when parameter-selection is enabled.

Details

K-Nearest Neighbor (KNN) is a memory-based classification or regression method with no explicit training phase. For classificatin purpose, it assumes that instances should have similar labels to their nearest neighbors.

Examples

Training data:

> df.train$Collect()
   ID X1   X2 X3  TYPE
1   0  2    1  A     1
2   1  3   10  A    10
3   2  3   10  B    10
4   3  3   10  C     1
5   4  1 1000  C     1
6   5  1 1000  A    10
7   6  1 1000  B    99
8   7  1  999  A    99
9   8  1  999  B    10
10  9  1 1000  C    10

Call the function:

knc <- hanaml.KNNClassifier(data = df.train, key = "ID",
                            features = c("X1", "X2", "X3"),
                            label = "TYPE", n.neighbors = 3,
                            voting.type = "majority",
                            algorithm = "brute-force",
                            categorical.variable = c("X1"))

Arguments

Value

Details

Examples

See also