GeometryDBSCAN

class hana_ml.algorithms.pal.clustering.GeometryDBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, algorithm=None, save_model=True)

GeometryDBSCAN is a geometry version of DBSCAN, which only accepts geometry points as input data. It works with geospatial data where distances between points can be computed in a geometrically efficient manner. Currently GeometryDBSCAN only accepts 2D points.

Parameters:

minptsint, optional

Represents the minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

epsfloat, optional

Specifies the scanning radius around a point.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to -1.

metric{'manhattan', 'euclidean','minkowski', 'chebyshev', 'standardized_euclidean', 'cosine'}, optional

Defines the metric used to compute the distance between two points.

Defaults to 'euclidean'.

minkowski_powerint, optional

When minkowski is chosen for metric, this parameter controls the value of power.

Only applicable when metric is 'minkowski'.

Defaults to 3.

algorithm{'brute-force', 'kd-tree'}, optional

Represents the chosen method to search for neighbouring data points.

Defaults to 'kd-tree'.

save_modelbool, optional

If set to true, the generated model will be saved.

It must be mentioned that save_model has to be set to true in order to utilize the function predict().

Defaults to True.

Examples

In SAP HANA, the test table PAL_GEO_DBSCAN_DATA_TBL can be created by the following SQL:

CREATE COLUMN TABLE PAL_GEO_DBSCAN_DATA_TBL (
    "ID" INTEGER,
    "POINT" ST_GEOMETRY);

Input DataFrame df:

>>> df = conn.table("PAL_GEO_DBSCAN_DATA_TBL")

Create a GeometryDBSCAN instance:

>>> geo_dbscan = GeometryDBSCAN(thread_ratio=0.2, metric='manhattan')

Perform fit():

>>> geo_dbscan.fit(data=df, key='ID')

Output:

>>> geo_dbscan.labels_.collect()
     ID  CLUSTER_ID
0     1           0
1     2           0
2     3           0
......
28   29          -1
29   30          -1

>>> geo_dbsan.model_.collect()
    ROW_INDEX    MODEL_CONTENT
0      0         {"Algorithm":"DBSCAN","Cluster":[{"ClusterID":...

Perform fit_predict():

>>> result = geo_dbscan.fit_predict(data=df, key='ID')

Output:

>>> result.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
......
28  29          -1
29  30          -1

Attributes:

labels_DataFrame: Label assigned to each sample.
model_DataFrame: Model content. Set to None if save_model is False.

Methods

`fit`(data[, key, features])	Fit the model to the training dataset.
`fit_predict`(data[, key, features])	Fit with the dataset and return the labels.
`get_model_metrics`()	Get the model metrics.
`get_score_metrics`()	Get the score metrics.

fit(data, key=None, features=None)

Fit the model to the training dataset.

Parameters:

dataDataFrame

DataFrame containing the data.

It must contain at least two columns: one ID column, and another for storing 2-D geometry points and the data type is 'ST_GEOMETRY'.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr, optional

Name of a column for storing geometry points and the data type is 'ST_GEOMETRY'.

If not provided, it defaults the first non-key column.

fit_predict(data, key=None, features=None)

Fit with the dataset and return the labels.

Parameters:

dataDataFrame

DataFrame containing the data. The structure is as follows. It must contain at least two columns: one ID column, and another for storing 2-D geometry points.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr, optional

Name of the column for storing 2-D geometry points.

If not provided, it defaults to the first non-key column.

Returns:

DataFrame: Label assigned to each sample.

get_model_metrics()

Get the model metrics.

Returns:

DataFrame: The model metrics.

get_score_metrics()

Get the score metrics.

Returns:

DataFrame: The score metrics.

Inherited Methods from PALBase

Besides those methods mentioned above, the GeometryDBSCAN class also inherits methods from PALBase class, please refer to PAL Base for more details.