hanaml.GeoDBSCAN is a R wrapper for SAP HANA PAL GeoDBSCAN algorithm.

hanaml.GeoDBSCAN(
  data = NULL,
  key = NULL,
  features = NULL,
  minpts = NULL,
  eps = NULL,
  thread.ratio = NULL,
  metric = NULL,
  minkowski.power = NULL,
  algorithm = NULL,
  save.model = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

key

character
Name of ID column.

features

character, optional
Name of the feature column. GeoDBSCAN only supports one feature.
If is not provided, it defaults to first non-ID columns.

minpts

integer, optional
The minimum number of points required to form a cluster
Note that
minpts and eps need to be provided together by user or these two parameters are automatically determined.

eps

double, optional
The scan radius.
Note that minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.
Values outside the range from 0 to 1 are ignored, and the actual number of threads used is then be heuristically determined.
Defaults to -1.

metric

character, optional
Ways to compute the distance between two points. Valid metric options include:

  • 'manhattan'

  • 'euclidean'

  • 'minkowski'

  • 'chebyshev'

  • 'standardized.euclidean'

  • 'cosine'

Defaults to "euclidean".

minkowski.power

integer, optional
When minkowski is choosed for "metric", this parameter controls the value of power. Only applicable when metric is 'minkowski'.
Defaults to 3.

algorithm

{"brute.force", "kd.tree"}, optional
Ways to search for neighbours.
Defaults to "kd.tree".

save.model

logical, optional
If TRUE, the generated model will be saved. save.model must be TRUE to call.
Defaults to TRUE.

Value

A "GeoDBSCAN" object with the following attributes:

  • labels : DataFrame
    Label assigned to each sample. -1 means the point is labeled as noise.

  • model : DataFrame
    PMML model.
    Set to NULL if no PMML model was requested.

Examples

In SAP HANA, the test table PAL_GEO_DBSCAN_DATA_TBL can be created by the following SQL:

 CREATE COLUMN TABLE PAL_GEO_DBSCAN_DATA_TBL (
                            "ID" INTEGER,
                            "POINT" ST_GEOMETRY);

Input DataFrame for clustering:

> data$Collect()
   ID                      POINT
1   1     SRID=0;POINT (0.1 0.1)
2   2    SRID=0;POINT (0.11 0.1)
3   3    SRID=0;POINT (0.1 0.11)
4   4   SRID=0;POINT (0.11 0.11)
5   5   SRID=0;POINT (0.12 0.11)
......
28 28 SRID=0;POINT (16.11 16.11)
29 29 SRID=0;POINT (20.11 20.12)
30 30 SRID=0;POINT (15.12 15.11)

Call the function:

> GeoDBSCAN <- hanaml.GeoDBSCAN(data,
                                key = "ID",
                                thread.ratio = 0.2,
                                metric = "Manhattan")

Output:

> DBSCAN$labels$Collect()
      ID    CLUSTER.ID
1      1             0
2      2             0
3      3             0
4      4             0
5      5             0
......
28    28            -1
29    29            -1
30    30            -1