DBSCAN
- class hana_ml.algorithms.pal.clustering.DBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, categorical_variable=None, category_weights=None, algorithm=None, save_model=True)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.
- Parameters
- minptsint, optional
The minimum number of points required to form a cluster.
Note
minpts
andeps
need to be provided together by user or these two parameters are automatically determined.- epsfloat, optional
The scan radius.
Note
minpts
andeps
need to be provided together by user or these two parameters are automatically determined.- thread_ratiofloat, optional
Controls the proportion of available threads to use.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.
Values between 0 and 1 will use that percentage of available threads.
Values outside this range tell PAL to heuristically determine the number of threads to use.
Defaults to heuristically determined.
- metric{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'standardized_euclidean', 'cosine'}, optional
Ways to compute the distance between two points.
Defaults to 'euclidean'.
- minkowski_powerint, optional
When minkowski is chosen for
metric
, this parameter controls the value of power. Only applicable whenmetric
is minkowski.Defaults to 3.
- categorical_variablestr or a list of str, optional
Specifies column(s) in the data that should be treated as categorical.
Defaults to None.
- category_weightsfloat, optional
Represents the weight of category attributes.
Defaults to 0.707.
- algorithm{'brute-force', 'kd-tree'}, optional
Ways to search for neighbours.
Defaults to 'kd-tree'.
- save_modelbool, optional
If true, the generated model will be saved.
save_model
must be True in order to call predict().Defaults to True.
Examples
Input dataframe df for clustering:
>>> df.collect() ID V1 V2 V3 0 1 0.10 0.10 B 1 2 0.11 0.10 A 2 3 0.10 0.11 C 3 4 0.11 0.11 B 4 5 0.12 0.11 A 5 6 0.11 0.12 E 6 7 0.12 0.12 A 7 8 0.12 0.13 C 8 9 0.13 0.12 D 9 10 0.13 0.13 D 10 11 0.13 0.14 A 11 12 0.14 0.13 C 12 13 10.10 10.10 A 13 14 10.11 10.10 F 14 15 10.10 10.11 E 15 16 10.11 10.11 E 16 17 10.11 10.12 A 17 18 10.12 10.11 B 18 19 10.12 10.12 B 19 20 10.12 10.13 D 20 21 10.13 10.12 F 21 22 10.13 10.13 A 22 23 10.13 10.14 A 23 24 10.14 10.13 D 24 25 4.10 4.10 A 25 26 7.11 7.10 C 26 27 -3.10 -3.11 C 27 28 16.11 16.11 A 28 29 20.11 20.12 C 29 30 15.12 15.11 A
Create DSBCAN instance:
>>> dbscan = DBSCAN(thread_ratio=0.2, metric='manhattan')
Perform fit on the given data:
>>> dbscan.fit(data=df, key='ID')
Expected output:
>>> dbscan.labels_.collect() ID CLUSTER_ID 0 1 0 1 2 0 2 3 0 3 4 0 4 5 0 5 6 0 6 7 0 7 8 0 8 9 0 9 10 0 10 11 0 11 12 0 12 13 1 13 14 1 14 15 1 15 16 1 16 17 1 17 18 1 18 19 1 19 20 1 20 21 1 21 22 1 22 23 1 23 24 1 24 25 -1 25 26 -1 26 27 -1 27 28 -1 28 29 -1 29 30 -1
- Attributes
- labels_DataFrame
Label assigned to each sample.
- model_DataFrame
Model content. Set to None if
save_model
is False.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, ...])Fit the DBSCAN model when given the training dataset.
fit_predict
(data[, key, features, ...])Fit with the dataset and return the labels.
predict
(data[, key, features])Assign clusters to data based on a fitted model.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, categorical_variable=None, string_variable=None, variable_weight=None)
Fit the DBSCAN model when given the training dataset.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-key columns.- categorical_variablestr or a list of str, optional
Specifies INTEGER column(s) that should be treated as categorical.
Other INTEGER columns will be treated as continuous.
Defaults to None.
- string_variablestr or a list of str, optional
Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.
Defaults to None.
- variable_weightdict, optional
Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified.
Defaults to None.
- Returns
- A fitted object of class "DBSCAN".
- fit_predict(data, key=None, features=None, categorical_variable=None, string_variable=None, variable_weight=None)
Fit with the dataset and return the labels.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresstr or a list of str, optional
Names of the features columns.
If
features
is not provided, it defaults to all the non-key columns.- categorical_variablestr or a list of str, optional
Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.
Defaults to None.
- string_variablestr or a list of str, optional
Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.
Defaults to None.
- variable_weightdict, optional
Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0.
Defaults to 1 for variables not specified.
- Returns
- DataFrame
Label assigned to each sample.
- predict(data, key=None, features=None)
Assign clusters to data based on a fitted model. The output structure of this method does not match that of fit_predict().
- Parameters
- dataDataFrame
Data points to match against computed clusters.
This dataframe's column structure should match that of the data used for fit().
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional.
Names of feature columns.
If
features
is not provided, it defaults to all non-key columns.
- Returns
- DataFrame
Cluster assignment results, with 3 columns:
Data point ID, with name and type taken from the input ID column.
CLUSTER_ID, type INTEGER, representing the cluster the data point is assigned to.
DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.
- create_model_state(model=None, function=None, pal_funcname='PAL_CLUSTER_ASSIGNMENT', state_description=None, force=False)
Create PAL model state.
- Parameters
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for cluster assignment.
- pal_funcnameint or str, optional
PAL function name. Must be a valid PAL procedure that supports model state.
Defaults to 'PAL_CLUSTER_ASSIGNMENT'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.
- set_model_state(state)
Set the model state by state information.
- Parameters
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.