hanaml.KDE.Rd
hanaml.KDE is a R wrapper for SAP HANA PAL Kernel Density Estimation.
hanaml.KDE(
data = NULL,
key = NULL,
features = NULL,
thread.ratio = NULL,
leaf.size = NULL,
kernel = NULL,
algorithm = NULL,
bandwidth = NULL,
distance.level = NULL,
minkowski.power = NULL,
abs.res.tol = NULL,
rel.res.tol = NULL,
resampling.method = NULL,
evaluation.metric = NULL,
repeat.times = NULL,
param.search.strategy = NULL,
random.state = NULL,
progress.indicator.id = NULL,
parameter.range = NULL,
parameter.values = NULL
)
DataFrame
DataFrame containting the data.
character, optional
Name of the ID column.
Defaults to the first column if not provided.
character or list of characters, optional
Names of features columns.
If is not provided, it defaults to all non-key columns of data
.
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
integer, optional
only valid when using kd-tree or ball-tree searching
specifies the number of samples in tree lead node.
Defaults to 30.
{'gaussian', 'tophat', 'epanechnikov',
'exponential', 'linear', 'cosine'}, optional
Specifies kernel function used for KDE.
'gaussian'
'tophat'
'epanechnikov'
'exponential'
'linear'
'cosine'
Defaults to 'gaussian'
{'brute-force', 'kd-tree', 'ball-tree'}
Specifies data structure used to speed up the calculation
process.
'brute-force'
use brute force searching.
'kd-tree'
use KD-tree searching.
'ball-tree'
use Ball-tree searching.
Defaults to 'none'.
double, optional
Specifies the bandwidth.
bandwidth = 0 means that the bandwidth will be provided by the
optimizer inside.
Defaults to 0.
{'manhattan', 'euclidean',
'minkowski', 'chebyshev'}, optional
Specifies the norm used to compute distance between the
train data and evaluation data.
'manhattan'
manhattan norm (l1 norm).
'euclidean'
euclidean norm (l2 norm).
'minkowski'
minkowski norm (p-norm).
'chebyshev'
chebyshev norm (maximum norm).
Defaults to 'euclidean'.
double, optional
When you use the Minkowski distance, this parameter controls the value of power.
Only valid when distance.level is 'minkowski'.
Defaults to 3.0.
double, optional
Spezifies the desired absolute tolerance.
It enable us to trade off computation time for accuracy.
Defaults to 0.
double, optional
Specifies the desired relative tolerance.
It enable us to trade off computation time for accuracy.
Defaults to 0.
{'loocv'}, optional
specifies the resampling values to perform model evaluation and
parameter selection only loocv is supported.
If no value is specified for this parameter, neither model
evaluation nor parameter selection is activated.
{'NLL'}, optional
Specifies the evaluation metric for model evaluation or parameter
selection, only NLL is supported.
If not specified, neither model evaluation
nor parameter selection is activated.
numeric, optional
Specifies the number of repeat times for resampling.
Defaults to 1.
{"grid", "random"}, optional
Specifies the method to activate parameter selection.
If not specified, parameter selection shall not be triggered.
double, optional
Specifies the seed for random number generation, where 0 means
current system time
is used as seed, and other values are simply real seed values.
Defaults to 0.
character, optional
Sets an ID of progress indicator for model evaluation or
parameter selection.
No progress indicator is active if no value is provided.
list/vector, optional
Specifies range of the bandwidth parameters for parameter
selection:
Parameter range should be specified by 3 numbers in the form of
c(start, step, end).
list/vector, optional
Specifies values of the bandwidth parameters for parameter
selection:
A 'KDE' object with the following attributes:
statistics: DataFrame
Statistics for model-evaluation/parameter-selection.
Available only when model-evaluation/parameter selection is enabled.
optim.param: DataFrame
Optimal parameters selected.
Available only when parameter-selection is enabled.
Kernel Density Estimation is a nonparametric approach to
obtain a continuous probability density estimation of a random
variable based on the sampled data points.
It is a quite popular technique in unsupervised learning,
feature engineering, and data modeling.
Input DataFrame data.df.fit:
> data.df.fit$Collect()
ID X1 X2
1 0 -0.4257698 -1.39613035
2 1 0.8841004 1.38149350
3 2 0.1341262 -0.03222389
4 3 0.8455036 2.86792078
5 4 0.2884408 1.51333705
6 5 -0.6667847 1.24498042
7 6 -2.1029683 -1.42832694
8 7 0.7699024 -0.47300711
9 8 0.2102913 0.32843074
10 9 0.4823225 -0.43796174
Call the function:
estimation <- hanaml.KDE(
data = data.df.fit, leaf.size = 10,
algorithm = "kd-tree", distance.level = "euclidean",
kernel = "gaussian",
parameter.values = list(bandwidth = c(0.68129, 1.0, 3.0, 5.0)),
resampling.method = "loocv", evaluation.metric = "NLL",
repeat.times = 2, param.search.strategy = "grid",
random.state = 1, progress.indicator.id = "TEST")
Output:
> estimation$optim.param$Collect()
PARAM_NAME INT_VALUE DOUBLE_VALUE STRING_VALUE
1 BANDWIDTH NA 1 <NA>