K-Nearest Neighbor(KNN) Regressor

hanaml.KNNRegressor is an R wrapper for SAP HANA PAL KNN algorithm forregression.

hanaml.KNNRegressor(
  data = NULL,
  key = NULL,
  features = NULL,
  label = NULL,
  n.neighbors = NULL,
  aggregate.type = NULL,
  metric = NULL,
  minkowski.power = NULL,
  algorithm = NULL,
  categorical.variable = NULL,
  category.weights = NULL,
  string.variable = NULL,
  variable.weight = NULL,
  resampling.method = NULL,
  evaluation.metric = NULL,
  fold.num = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.search.times = NULL,
  random.state = NULL,
  timeout = NULL,
  progress.indicator.id = NULL,
  parameter.range = NULL,
  parameter.values = NULL,
  thread.ratio = NULL,
  reduction.rate = NULL,
  min.resource.rate = NULL,
  aggressive.elimination = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

features

character of list of characters, optional
Name of feature columns.
If not provided, it defaults all non-key, non-label columns of data.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

n.neighbors

integer, optional
Number of nearest neighbors.
Defaults to 1.

aggregate.type

c("uniform", "distance-weighted"), optional
Method used for averaging the values of the K-nearest neighbors. Defaults to 'distance-weighted'.

metric

c("manhattan", "euclidean", "minkowski", "chebyshev", "cosine"), optional
Ways to compute the distance between data points.
Defaults to 'euclidean'.

minkowski.power

double, optional
When 'Minkowski' is used for metric, this parameter controls the value of power.
Only valid when metric is "minkowski". Defaults to 3.0.

algorithm

c("brute-force", "kd-tree"), optional
Algorithm used to compute the nearest neighbors.
When metric is "cosine", using "kd-tree" searching will not have much help.
Defaults to 'brute-force'.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

category.weights

double, optional
Represents the weight of category attributes.
Defaults to 0.707.

string.variable

character or list of characters, optional
Indicates a string column storing non-categorical data.
Levenshtein distance is used to calculate similarity between two strings.
Ignored if it is not a string(e.g., of type VARCHAR or NVARCHAR) column.
Not valid if metric is set as "cosine".
Defaults to NULL.

variable.weight

named list, optional
Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified.
Defaults to NULL.

resampling.method

character, optional
Specifies the resampling method for model evaluation or parameter selection.
valid options are listed as follows: "cv", "bootstrap", "cv_sha", "bootstrap_sha", "cv_hyperband", "bootstrap_hyperband".
Note that methods like "*_sha" and "*_hyperband" are only applicable to parameter selection, not model evaluation.
If no value is specified, neither model evaluation nor parameter selection is activated.

evaluation.metric

character, optional
Specifies the evaluation metric for model evaluation or parameter selection.
Currently the only valid option is "rmse".
Must be specified together with resampling.method to activate model evaluation or parameter selection.
No default value.

fold.num

integer, optional
Specifies the fold number for the cross-validation(cv). Mandatory and valid only when resampling.method is "cv", "cv_sha" or "cv_hyperband".

repeat.times

numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.

param.search.strategy

c("grid", "random"), optional
Specifies the method to activate parameter selection.
If not specified, model parameter selection shall not be triggered.
Defaults to "random" and cannot be changed if resampling.method is set as "cv_hyperband" or "bootstrap_hyperband"; otherwise no default value.

random.search.times

integer, optional
Specifies the number of times to randomly select candidate parameters for selection. Mandatory and valid only when param.search.strategy is set as "random", or when resampling.method is set as "cv_hyperband" or "bootstrap_hyperband".

random.state

numeric, optional
Specifies the seed for random generation.
Use system time when 0 is specified.

timeout

integer, optional
Specifies maximum running time for model evaluation or parameter selection in seconds.
No timeout when 0 is specified.

progress.indicator.id

character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.

parameter.range

a named list/vector, optional
Specifies range of the following parameters for parameter selection:
minkowski.power, category.weights, n.neighbors.
Parameter range should be specified by 3 numbers in the form of c(start, step, end).
Examples:
parameter.range <- list(category.weights = c(0.5, 0.1, 1)).
If param.search.strategy is 'random', then the step has no effect and thus can be omitted.

parameter.values

a named list/vector, optional
Specifies values of the following parameters for parameter selection:
metric, aggregate.type, minkowski.power, category.weights, n.neighbors.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

reduction.rate

numeric, optional
Specifies reduction rate in SHA or Hyperband method.
Valid when resampling.method is specified and ends with either "sha" or "hyperband"(e.g. "cv_sha", "bootstrap_hyperband").
Defaults to 3.0.

min.resource.rate

numeric, optional
Specifies the minimum resource rate that should be used in SHA or hyperband iteration.
Valid only when resampling.method is specified with a valid option that ends with "sha" or "hyperband"(e.g. "cv_sha", "bootstrap_hyperband").
Defaults to 0.0.

aggressive.elimination

logical, optional
Specifies whether to apply aggressive elimination while using SHA method.

FALSE: do not apply aggressive elimination
TRUE: apply aggressive elimination

Valid only when resampling.method is specified and ends with "sha".
Defaults to FALSE.
Note: Aggressive elimination happens when the data size and parameters size to be searched does not match, and there are still bunch of parameters to be searched while data size reaches its upper limits. If aggressive elimination is applied, lower bound of limit of data size will be used multiple times first to reduce number of parameters.

Value

An R6 object of class "KNNRegressor" with the following attributes and methods:

Attributes

statistics: DataFrame Statistics for model-evaluation/parameter-selection. Available only when model-evaluation/parameter selection enabled.
optim.param: DataFrame Optimal parameters selected. Available only when parameter selection is enabled.

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > knr <- hanaml.KNNRegressor(data=df, key="ID")
   > knr$CreateModelState()

Arguments:

model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model.
algorithm: character
Specifies the PAL algorithm associated with model.
Defaults to self$pal.algorithm.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model.
Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > knr <- hanaml.SVR(data=df, key="ID")
   > knr$CreateModelState()

After using the model state for real-time scoring, we can delete the state by calling:


   > knr$DelateModelState()

Arguments:

state: DataFrame
DataFrame containing the state info.
Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Details

K-Nearest Neighbor (KNN) is a memory-based classification or regression method with no explicit training phase. For regression purpose, it assumes that instances should have similar values to their nearest neighbors.

Examples

Training DataFrame df.train:


> df.train$Collect()
   ID X1   X2 X3 VALUE
1   0  2    1  A     1
2   1  3   10  A    10
3   2  3   10  B    10
4   3  3   10  C     1
5   4  1 1000  C     1
6   5  1 1000  A    10
7   6  1 1000  B    99
8   7  1  999  A    99
9   8  1  999  B    10
10  9  1 1000  C    10

Call the function:

knr <- hanaml.KNNRegressor(data = df.train, key = "ID",
                           features = c("X1", "X2", "X3"),
                           label = "VALUE", n.neighbors = 3,
                           aggregate.type = "uniform",
                           algorithm = "brute-force",
                           categorical.variable = c("X1"))

Arguments

Value

Details

Examples

See also