hanaml.KDE is a R wrapper for SAP HANA PAL Kernel Density Estimation.

hanaml.KDE(
  data = NULL,
  key = NULL,
  features = NULL,
  thread.ratio = NULL,
  leaf.size = NULL,
  kernel = NULL,
  algorithm = NULL,
  bandwidth = NULL,
  distance.level = NULL,
  minkowski.power = NULL,
  abs.res.tol = NULL,
  rel.res.tol = NULL,
  resampling.method = NULL,
  evaluation.metric = NULL,
  repeat.times = NULL,
  param.search.strategy = NULL,
  random.state = NULL,
  progress.indicator.id = NULL,
  parameter.range = NULL,
  parameter.values = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character, optional
Name of the ID column.
Defaults to the first column if not provided.

features

character or list of characters, optional
Names of features columns.
If is not provided, it defaults to all non-key columns of data.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

leaf.size

integer, optional
only valid when using kd-tree or ball-tree searching specifies the number of samples in tree lead node.
Defaults to 30.

kernel

{'gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'}, optional
Specifies kernel function used for KDE.

  • 'gaussian'

  • 'tophat'

  • 'epanechnikov'

  • 'exponential'

  • 'linear'

  • 'cosine'

Defaults to 'gaussian'

algorithm

{'brute-force', 'kd-tree', 'ball-tree'}
Specifies data structure used to speed up the calculation process.

  • 'brute-force' use brute force searching.

  • 'kd-tree' use KD-tree searching.

  • 'ball-tree' use Ball-tree searching.

Defaults to 'none'.

bandwidth

double, optional
Specifies the bandwidth.
bandwidth = 0 means that the bandwidth will be provided by the optimizer inside.
Defaults to 0.

distance.level

{'manhattan', 'euclidean', 'minkowski', 'chebyshev'}, optional
Specifies the norm used to compute distance between the train data and evaluation data.

  • 'manhattan' manhattan norm (l1 norm).

  • 'euclidean' euclidean norm (l2 norm).

  • 'minkowski' minkowski norm (p-norm).

  • 'chebyshev' chebyshev norm (maximum norm).

Defaults to 'euclidean'.

minkowski.power

double, optional
When you use the Minkowski distance, this parameter controls the value of power. Only valid when distance.level is 'minkowski'.
Defaults to 3.0.

abs.res.tol

double, optional
Spezifies the desired absolute tolerance. It enable us to trade off computation time for accuracy.
Defaults to 0.

rel.res.tol

double, optional
Specifies the desired relative tolerance. It enable us to trade off computation time for accuracy.
Defaults to 0.

resampling.method

{'loocv'}, optional
specifies the resampling values to perform model evaluation and parameter selection only loocv is supported.
If no value is specified for this parameter, neither model evaluation nor parameter selection is activated.

evaluation.metric

{'NLL'}, optional
Specifies the evaluation metric for model evaluation or parameter selection, only NLL is supported.
If not specified, neither model evaluation nor parameter selection is activated.

repeat.times

numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.

param.search.strategy

{"grid", "random"}, optional
Specifies the method to activate parameter selection. If not specified, parameter selection shall not be triggered.

random.state

double, optional
Specifies the seed for random number generation, where 0 means current system time is used as seed, and other values are simply real seed values.
Defaults to 0.

progress.indicator.id

character, optional
Sets an ID of progress indicator for model evaluation or parameter selection.
No progress indicator is active if no value is provided.

parameter.range

list/vector, optional
Specifies range of the bandwidth parameters for parameter selection:
Parameter range should be specified by 3 numbers in the form of c(start, step, end).

parameter.values

list/vector, optional
Specifies values of the bandwidth parameters for parameter selection:

Value

A 'KDE' object with the following attributes:

  • statistics: DataFrame
    Statistics for model-evaluation/parameter-selection.
    Available only when model-evaluation/parameter selection is enabled.

  • optim.param: DataFrame
    Optimal parameters selected.
    Available only when parameter-selection is enabled.

Details

Kernel Density Estimation is a nonparametric approach to obtain a continuous probability density estimation of a random variable based on the sampled data points.
It is a quite popular technique in unsupervised learning, feature engineering, and data modeling.

Examples

Input DataFrame data.df.fit:


> data.df.fit$Collect()
 ID         X1          X2
1 0 -0.4257698 -1.39613035
2 1 0.8841004   1.38149350
3 2 0.1341262  -0.03222389
4 3 0.8455036   2.86792078
5 4 0.2884408   1.51333705
6 5 -0.6667847  1.24498042
7 6 -2.1029683 -1.42832694
8 7 0.7699024  -0.47300711
9 8 0.2102913   0.32843074
10 9 0.4823225 -0.43796174

Call the function:

estimation <- hanaml.KDE(
           data = data.df.fit, leaf.size = 10,
           algorithm = "kd-tree", distance.level = "euclidean",
           kernel = "gaussian",
           parameter.values = list(bandwidth = c(0.68129, 1.0, 3.0, 5.0)),
           resampling.method = "loocv", evaluation.metric = "NLL",
           repeat.times = 2, param.search.strategy = "grid",
           random.state = 1, progress.indicator.id = "TEST")

Output:


> estimation$optim.param$Collect()
 PARAM_NAME INT_VALUE DOUBLE_VALUE STRING_VALUE
1 BANDWIDTH        NA            1         <NA>

See also