hanaml.DBSCAN is a R wrapper for SAP HANA PAL DBSCAN algorithm.
hanaml.DBSCAN(
data = NULL,
key = NULL,
features = NULL,
minpts = NULL,
eps = NULL,
thread.ratio = NULL,
metric = NULL,
minkowski.power = NULL,
categorical.variable = NULL,
category.weights = NULL,
algorithm = NULL,
save.model = NULL,
string.variable = NULL,
variable.weight = NULL
)
Arguments
| data |
DataFrame
DataFrame containting the data.
|
| key |
character
Name of the ID column.
|
| features |
character or list of characters, optional
Names of features columns.
If is not provided, it defaults to all non-key columns of data.
|
| minpts |
integer, optional
The minimum number of points required to form a cluster
Note that
minpts and eps need to be provided together by user or these
two parameters are automatically determined.
|
| eps |
double, optional
The scan radius.
Note that minpts and eps need to be provided together
by user or these two parameters are automatically determined.
|
| thread.ratio |
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads. Values between 0 and 1 will use up to
that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads
used is then be heuristically determined.
Defaults to -1.
|
| metric |
character, optional
Ways to compute the distance between two points. Valid metric options include:
"manhattan"
"euclidean"
"minkowski"
"chebyshev"
"standardized.euclidean"
"cosine"
Defaults to "euclidean". |
| minkowski.power |
integer, optional
When minkowski is choosed for "metric", this parameter
controls the value of power.
Only applicable when metric is "minkowski".
Defaults to 3.
|
| categorical.variable |
character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:
VALID only for variables of "INTEGER" type, omitted otherwise.
No default value. |
| category.weights |
double, optional
Represents the weight of category attributes.
Defaults to 0.707.
|
| algorithm |
{"brute.force", "kd.tree"}, optional
Ways to search for neighbours.
Defaults to "kd.tree".
|
| save.model |
logical, optional
If TRUE, the generated model will be saved.
save.model must be TRUE to call.
Defaults to TRUE.
|
| string.variable |
character or list of character, optional
Indicates a string column storing non-categorical data.
Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.
Defaults to NULL.
|
| variable.weight |
named list, optional
Specifies the weight of a variable participating in distance calculation.
The value must be greater or equal to 0. Defaults to 1 for variables not specified.
Defaults to NULL.
|
Value
Returns a "DBSCAN" objects with the following attributes:
Examples
Input DataFrame data:
> data$Collect()
ID V1 V2 V3
1 1 0.10 0.10 B
2 2 0.11 0.10 A
3 3 0.10 0.11 C
4 4 0.11 0.11 B
......
28 28 16.11 16.11 A
29 29 20.11 20.12 C
30 30 15.12 15.11 A
Call the function
> DBSCAN <-hanaml.DBSCAN(data, key = "ID",
thread.ratio = 0.2,
metric = "Manhattan")
Output:
> DBSCAN$labels$Collect()
ID CLUSTER.ID
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
......
28 28 -1
29 29 -1
30 30 -1
See also