GeometryDBSCAN

class hana_ml.algorithms.pal.clustering.GeometryDBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, algorithm=None, save_model=True)

This function is a geometry version of DBSCAN, which only accepts geometry points as input data. Currently it only accepts 2-D points.

Parameters:

minptsint, optional

The minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

epsfloat, optional

The scan radius.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to -1.

metric{'manhattan', 'euclidean','minkowski',

'chebyshev', 'standardized_euclidean', 'cosine'}, optional

Ways to compute the distance between two points.

Defaults to 'euclidean'.

minkowski_powerint, optional

When minkowski is chosen for metric, this parameter controls the value of power.

Only applicable when metric is 'minkowski'.

Defaults to 3.

algorithm{'brute-force', 'kd-tree'}, optional

Ways to search for neighbours.

Defaults to 'kd-tree'.

save_modelbool, optional

If true, the generated model will be saved.

save_model must be True in order to call predict().

Defaults to True.

Examples

In SAP HANA, the test table PAL_GEO_DBSCAN_DATA_TBL can be created by the following SQL:

CREATE COLUMN TABLE PAL_GEO_DBSCAN_DATA_TBL (
  "ID" INTEGER,
  "POINT" ST_GEOMETRY);

Then, input dataframe df for clustering:

>>> df = conn.table("PAL_GEO_DBSCAN_DATA_TBL")

Create DSBCAN instance:

>>> geo_dbscan = GeometryDBSCAN(thread_ratio=0.2, metric='manhattan')

Perform fit on the given data:

>>> geo_dbscan.fit(data = df, key='ID')

Expected output:

>>> geo_dbscan.labels_.collect()
     ID  CLUSTER_ID
0     1           0
1     2           0
2     3           0
......
28   29          -1
29   30          -1

>>> geo_dbsan.model_.collect()
    ROW_INDEX    MODEL_CONTENT
0      0         {"Algorithm":"DBSCAN","Cluster":[{"ClusterID":...

Perform fit_predict on the given data:

>>> result = geo_dbscan.fit_predict(df, key='ID')

Expected output:

>>> result.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
......
28  29          -1
29  30          -1

Attributes:

labels_DataFrame: Label assigned to each sample.
model_DataFrame: Model content. Set to None if save_model is False.

Methods

`fit`(data[, key, features])	Fit the Geometry DBSCAN model when given the training dataset.
`fit_predict`(data[, key, features])	Fit with the dataset and return the labels.

fit(data, key=None, features=None)

Fit the Geometry DBSCAN model when given the training dataset.

Parameters:

dataDataFrame

DataFrame containing the data for applying geometry DBSCAN.

It must contain at least two columns: one ID column, and another for storing 2-D geometry points.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr, optional

Name of the column for storing geometry points.

If not provided, it defaults the first non-key column.

fit_predict(data, key=None, features=None)

Fit with the dataset and return the labels.

Parameters:

dataDataFrame

DataFrame containing the data. The structure is as follows.

It must contain at least two columns: one ID column, and another for storing 2-D geometry points.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr, optional

Name of the column for storing 2-D geometry points.

If not provided, it defaults to the first non-key column.

Returns:

DataFrame: Label assigned to each sample.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the GeometryDBSCAN class also inherits methods from PALBase class, please refer to PAL Base for more details.