DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

hanaml.DBSCAN is a R wrapper for SAP HANA PAL DBSCAN algorithm.

hanaml.DBSCAN(
  data = NULL,
  key = NULL,
  features = NULL,
  minpts = NULL,
  eps = NULL,
  thread.ratio = NULL,
  metric = NULL,
  minkowski.power = NULL,
  categorical.variable = NULL,
  category.weights = NULL,
  algorithm = NULL,
  save.model = NULL,
  string.variable = NULL,
  variable.weight = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of the ID column.

features

character or list of characters, optional
Names of features columns.
If is not provided, it defaults to all non-key columns of data.

minpts

integer, optional
The minimum number of points required to form a cluster
Note that
minpts and eps need to be provided together by user or these two parameters are automatically determined.

eps

double, optional
The scan radius.
Note that minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined.
Defaults to -1.

metric

character, optional
Ways to compute the distance between two points. Valid metric options include:

"manhattan"
"euclidean"
"minkowski"
"chebyshev"
"standardized.euclidean"
"cosine"

Defaults to "euclidean".

minkowski.power

integer, optional
When minkowski is choosed for "metric", this parameter controls the value of power. Only applicable when metric is "minkowski".
Defaults to 3.

categorical.variable

character or list/vector of characters, optional
Indicates features should be treated as categorical variable.
The default behavior is dependent on what input is given:

"VARCHAR" and "NVARCHAR": categorical
"INTEGER" and "DOUBLE": continuous.

VALID only for variables of "INTEGER" type, omitted otherwise.
No default value.

category.weights

double, optional
Represents the weight of category attributes.
Defaults to 0.707.

algorithm

{"brute.force", "kd.tree"}, optional
Ways to search for neighbours.
Defaults to "kd.tree".

save.model

logical, optional
If TRUE, the generated model will be saved. save.model must be TRUE to call.
Defaults to TRUE.

string.variable

character or list of character, optional
Indicates a string column storing non-categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.
Defaults to NULL.

variable.weight

named list, optional
Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified.
Defaults to NULL.

Value

Returns an R6 object of class "DBSCAN" with the following attributes and methods:
Attributes

labels : DataFrame
Label assigned to each sample.
model : DataFrame
PMML model. Set to NULL if no PMML model was requested.

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > dbs <- hanaml.DBSCAN(data=df, key="ID")
   > dbs$CreateModelState()

Arguments:

model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model.
algorithm: character
Specifies the PAL algorithm associated with model.
Defaults to self$pal.algorithm.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model.
Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > dbs <- hanaml.DBSCAN(data=df, key="ID")
   > dbs$CreateModelState()

After using the model state for real-time scoring, we can delete the state by calling:


   > dbs$DelateModelState()

Arguments:

state: DataFrame
DataFrame containing the state info.
Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Examples

Input DataFrame data:


> data$Collect()
    ID     V1     V2 V3
1    1   0.10   0.10  B
2    2   0.11   0.10  A
3    3   0.10   0.11  C
4    4   0.11   0.11  B
......
28  28  16.11  16.11  A
29  29  20.11  20.12  C
30  30  15.12  15.11  A

Call the function


> DBSCAN <-hanaml.DBSCAN(data, key = "ID",
                         thread.ratio = 0.2,
                         metric = "Manhattan")

Output:


> DBSCAN$labels$Collect()
      ID    CLUSTER.ID
1      1             0
2      2             0
3      3             0
4      4             0
5      5             0
......
28    28            -1
29    29            -1
30    30            -1

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Arguments

Value

Examples

See also