GeometryDBSCAN
- class hana_ml.algorithms.pal.clustering.GeometryDBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, algorithm=None, save_model=True)
This function is a geometry version of DBSCAN, which only accepts geometry points as input data. Currently it only accepts 2-D points.
- Parameters
- minptsint, optional
The minimum number of points required to form a cluster.
Note
minpts
andeps
need to be provided together by user or these two parameters are automatically determined.- epsfloat, optional
The scan radius.
Note
minpts
andeps
need to be provided together by user or these two parameters are automatically determined.- thread_ratiofloat, optional
Controls the proportion of available threads to use.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to -1.
- metric{'manhattan', 'euclidean','minkowski',
'chebyshev', 'standardized_euclidean', 'cosine'}, optional
Ways to compute the distance between two points.
Defaults to 'euclidean'.
- minkowski_powerint, optional
When minkowski is chosen for
metric
, this parameter controls the value of power.Only applicable when
metric
is 'minkowski'.Defaults to 3.
- algorithm{'brute-force', 'kd-tree'}, optional
Ways to search for neighbours.
Defaults to 'kd-tree'.
- save_modelbool, optional
If true, the generated model will be saved.
save_model
must be True in order to call predict().Defaults to True.
Examples
In SAP HANA, the test table PAL_GEO_DBSCAN_DATA_TBL can be created by the following SQL:
CREATE COLUMN TABLE PAL_GEO_DBSCAN_DATA_TBL ( "ID" INTEGER, "POINT" ST_GEOMETRY);
Then, input dataframe df for clustering:
>>> df = conn.table("PAL_GEO_DBSCAN_DATA_TBL")
Create DSBCAN instance:
>>> geo_dbscan = GeometryDBSCAN(thread_ratio=0.2, metric='manhattan')
Perform fit on the given data:
>>> geo_dbscan.fit(data = df, key='ID')
Expected output:
>>> geo_dbscan.labels_.collect() ID CLUSTER_ID 0 1 0 1 2 0 2 3 0 ...... 28 29 -1 29 30 -1
>>> geo_dbsan.model_.collect() ROW_INDEX MODEL_CONTENT 0 0 {"Algorithm":"DBSCAN","Cluster":[{"ClusterID":...
Perform fit_predict on the given data:
>>> result = geo_dbscan.fit_predict(df, key='ID')
Expected output:
>>> result.collect() ID CLUSTER_ID 0 1 0 1 2 0 2 3 0 ...... 28 29 -1 29 30 -1
- Attributes
- labels_DataFrame
Label assigned to each sample.
- model_DataFrame
Model content. Set to None if
save_model
is False.
Methods
fit
(data[, key, features])Fit the Geometry DBSCAN model when given the training dataset.
fit_predict
(data[, key, features])Fit with the dataset and return the labels.
- fit(data, key=None, features=None)
Fit the Geometry DBSCAN model when given the training dataset.
- Parameters
- dataDataFrame
DataFrame containing the data for applying geometry DBSCAN.
It must contain at least two columns: one ID column, and another for storing 2-D geometry points.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresstr, optional
Name of the column for storing geometry points.
If not provided, it defaults the first non-key column.
- fit_predict(data, key=None, features=None)
Fit with the dataset and return the labels.
- Parameters
- dataDataFrame
DataFrame containing the data. The structure is as follows.
It must contain at least two columns: one ID column, and another for storing 2-D geometry points.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresstr, optional
Name of the column for storing 2-D geometry points.
If not provided, it defaults to the first non-key column.
- Returns
- DataFrame
Label assigned to each sample.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.