DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

hanaml.DBSCAN is a R wrapper for SAP HANA PAL DBSCAN algorithm.

hanaml.DBSCAN(
  data = NULL,
  key = NULL,
  features = NULL,
  minpts = NULL,
  eps = NULL,
  thread.ratio = NULL,
  metric = NULL,
  minkowski.power = NULL,
  categorical.variable = NULL,
  category.weights = NULL,
  algorithm = NULL,
  save.model = NULL,
  string.variable = NULL,
  variable.weight = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
key	`character` Name of the ID column.
features	`character or list of characters, optional` Names of features columns. If is not provided, it defaults to all non-key columns of `data`.
minpts	`integer, optional` The minimum number of points required to form a cluster Note that minpts and eps need to be provided together by user or these two parameters are automatically determined.
eps	`double, optional` The scan radius. Note that minpts and eps need to be provided together by user or these two parameters are automatically determined.
thread.ratio	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined. Defaults to -1.
metric	`character, optional` Ways to compute the distance between two points. Valid metric options include: `"manhattan"` `"euclidean"` `"minkowski"` `"chebyshev"` `"standardized.euclidean"` `"cosine"` Defaults to "euclidean".
minkowski.power	`integer, optional` When minkowski is choosed for "metric", this parameter controls the value of power. Only applicable when metric is "minkowski". Defaults to 3.
categorical.variable	`character or list/vector of characters, optional` Indicates features should be treated as categorical variable. The default behavior is dependent on what input is given: "VARCHAR" and "NVARCHAR": categorical "INTEGER" and "DOUBLE": continuous. VALID only for variables of "INTEGER" type, omitted otherwise. No default value.
category.weights	`double, optional` Represents the weight of category attributes. Defaults to 0.707.
algorithm	`{"brute.force", "kd.tree"}, optional` Ways to search for neighbours. Defaults to "kd.tree".
save.model	`logical, optional` If TRUE, the generated model will be saved. save.model must be TRUE to call. Defaults to TRUE.
string.variable	`character or list of character, optional` Indicates a string column storing non-categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column. Defaults to NULL.
variable.weight	`named list, optional` Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified. Defaults to NULL.

Value

Returns a "DBSCAN" objects with the following attributes:

labels : DataFrame
Label assigned to each sample.
model : DataFrame
PMML model. Set to NULL if no PMML model was requested.

Examples

Input DataFrame data:

> data$Collect()
    ID     V1     V2 V3
1    1   0.10   0.10  B
2    2   0.11   0.10  A
3    3   0.10   0.11  C
4    4   0.11   0.11  B
......
28  28  16.11  16.11  A
29  29  20.11  20.12  C
30  30  15.12  15.11  A

Call the function

> DBSCAN <-hanaml.DBSCAN(data, key = "ID",
                         thread.ratio = 0.2,
                         metric = "Manhattan")

Output:

> DBSCAN$labels$Collect()
      ID    CLUSTER.ID
1      1             0
2      2             0
3      3             0
4      4             0
5      5             0
......
28    28            -1
29    29            -1
30    30            -1

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Arguments

Value

Examples

See also