GeometryDBSCAN
- class hana_ml.algorithms.pal.clustering.GeometryDBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, algorithm=None, save_model=True)
GeometryDBSCAN is a geometry version of DBSCAN, which only accepts geometry points as input data. It works with geospatial data where distances between points can be computed in a geometrically efficient manner. Currently GeometryDBSCAN only accepts 2D points.
- Parameters:
- minptsint, optional
Represents the minimum number of points required to form a cluster.
Note
minpts
andeps
need to be provided together by user or these two parameters are automatically determined.- epsfloat, optional
Specifies the scanning radius around a point.
Note
minpts
andeps
need to be provided together by user or these two parameters are automatically determined.- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to -1.
- metric{'manhattan', 'euclidean','minkowski', 'chebyshev', 'standardized_euclidean', 'cosine'}, optional
Defines the metric used to compute the distance between two points.
Defaults to 'euclidean'.
- minkowski_powerint, optional
When minkowski is chosen for
metric
, this parameter controls the value of power.Only applicable when
metric
is 'minkowski'.Defaults to 3.
- algorithm{'brute-force', 'kd-tree'}, optional
Represents the chosen method to search for neighbouring data points.
Defaults to 'kd-tree'.
- save_modelbool, optional
If set to true, the generated model will be saved.
It must be mentioned that
save_model
has to be set to true in order to utilize the function predict().Defaults to True.
Examples
In SAP HANA, the test table PAL_GEO_DBSCAN_DATA_TBL can be created by the following SQL:
CREATE COLUMN TABLE PAL_GEO_DBSCAN_DATA_TBL ( "ID" INTEGER, "POINT" ST_GEOMETRY);
Input DataFrame df:
>>> df = conn.table("PAL_GEO_DBSCAN_DATA_TBL")
Create a GeometryDBSCAN instance:
>>> geo_dbscan = GeometryDBSCAN(thread_ratio=0.2, metric='manhattan')
Perform fit():
>>> geo_dbscan.fit(data=df, key='ID')
Output:
>>> geo_dbscan.labels_.collect() ID CLUSTER_ID 0 1 0 1 2 0 2 3 0 ...... 28 29 -1 29 30 -1
>>> geo_dbsan.model_.collect() ROW_INDEX MODEL_CONTENT 0 0 {"Algorithm":"DBSCAN","Cluster":[{"ClusterID":...
Perform fit_predict():
>>> result = geo_dbscan.fit_predict(data=df, key='ID')
Output:
>>> result.collect() ID CLUSTER_ID 0 1 0 1 2 0 2 3 0 ...... 28 29 -1 29 30 -1
- Attributes:
- labels_DataFrame
Label assigned to each sample.
- model_DataFrame
Model content. Set to None if
save_model
is False.
Methods
fit
(data[, key, features])Fit the model to the training dataset.
fit_predict
(data[, key, features])Fit with the dataset and return the labels.
Get the model metrics.
Get the score metrics.
- fit(data, key=None, features=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
DataFrame containing the data.
It must contain at least two columns: one ID column, and another for storing 2-D geometry points and the data type is 'ST_GEOMETRY'.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresstr, optional
Name of a column for storing geometry points and the data type is 'ST_GEOMETRY'.
If not provided, it defaults the first non-key column.
- fit_predict(data, key=None, features=None)
Fit with the dataset and return the labels.
- Parameters:
- dataDataFrame
DataFrame containing the data. The structure is as follows. It must contain at least two columns: one ID column, and another for storing 2-D geometry points.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresstr, optional
Name of the column for storing 2-D geometry points.
If not provided, it defaults to the first non-key column.
- Returns:
- DataFrame
Label assigned to each sample.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the GeometryDBSCAN class also inherits methods from PALBase class, please refer to PAL Base for more details.