DBSCAN
- class hana_ml.algorithms.pal.clustering.DBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, categorical_variable=None, category_weights=None, algorithm=None, save_model=True)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.
It separates high-density regions from low-density ones, allowing it to discover clusters of arbitrary shape in data containing noise and outliers.
- Parameters:
- minptsint, optional
Represents the minimum number of points required to form a cluster.
Note
minpts
andeps
need to be provided together by user or these two parameters are automatically determined.- epsfloat, optional
Specifies the scanning radius around a point.
Note
minpts
andeps
need to be provided together by user or these two parameters are automatically determined.- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to heuristically determined.
- metric{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'standardized_euclidean', 'cosine'}, optional
Ways to compute the distance between two points.
Defaults to 'euclidean'.
- minkowski_powerint, optional
When minkowski is chosen for
metric
, this parameter controls the value of power. Only applicable whenmetric
is 'minkowski'.Defaults to 3.
- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- category_weightsfloat, optional
Represents the weight of category attributes.
Defaults to 0.707.
- algorithm{'brute-force', 'kd-tree'}, optional
Represents the chosen method to search for neighbouring data points.
Defaults to 'kd-tree'.
- save_modelbool, optional
If set to true, the generated model will be saved. It must be mentioned that
save_model
has to be set to true in order to utilize the function predict().Defaults to True.
Examples
Input DataFrame df:
>>> df.collect() ID V1 V2 V3 0 1 0.10 0.10 B 1 2 0.11 0.10 A ... 28 29 20.11 20.12 C 29 30 15.12 15.11 A
Create a DSBCAN instance:
>>> dbscan = DBSCAN(thread_ratio=0.2, metric='manhattan')
Perform fit():
>>> dbscan.fit(data=df, key='ID')
Output:
>>> dbscan.labels_.collect() ID CLUSTER_ID 0 1 0 1 2 0 2 3 0 ... 27 28 -1 28 29 -1 29 30 -1
- Attributes:
- labels_DataFrame
Label assigned to each sample.
- model_DataFrame
Model content. Set to None if
save_model
is False.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, key, features, ...])Fit the model to the training dataset.
fit_predict
(data[, key, features, ...])Fit with the dataset and return the labels.
Get the model metrics.
Get the score metrics.
predict
(data[, key, features])Assign clusters to data based on a fitted model.
set_model_state
(state)Set the model state by state information.
- fit(data, key=None, features=None, categorical_variable=None, string_variable=None, variable_weight=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
A list of Names of the feature columns. Since the introduction of SAP HANA Cloud 24 QRC03, the data type support for features has been expanded to include VECTOR TYPE, in addition to the previously supported types such as INTEGER, DOUBLE, DECIMAL(p, s), VARCHAR, and NVARCHAR.
If
features
is not provided, it defaults to all non-key columns. This means that all columns except the key column will be considered as features.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- string_variablestr or a a list of str, optional
Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.
Defaults to None.
- variable_weightdict, optional
Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified.
Defaults to None.
- Returns:
- A fitted object of class "DBSCAN".
- fit_predict(data, key=None, features=None, categorical_variable=None, string_variable=None, variable_weight=None)
Fit with the dataset and return the labels.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
A list of Names of the feature columns. Since the introduction of SAP HANA Cloud 24 QRC03, the data type support for features has been expanded to include VECTOR TYPE, in addition to the previously supported types such as INTEGER, DOUBLE, DECIMAL(p, s), VARCHAR, and NVARCHAR.
If
features
is not provided, it defaults to all non-key columns. This means that all columns except the key column will be considered as features.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- string_variablestr or a list of str, optional
Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.
Defaults to None.
- variable_weightdict, optional
Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0.
Defaults to 1 for variables not specified.
- Returns:
- DataFrame
Label assigned to each sample.
- predict(data, key=None, features=None)
Assign clusters to data based on a fitted model. This fucntion does not support the data with VECTOR type. The output structure of this method does not match that of fit_predict().
- Parameters:
- dataDataFrame
Data points to match against computed clusters.
This dataframe's column structure should match that of the data used for fit().
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional.
Names of feature columns.
If
features
is not provided, it defaults to all non-key columns.
- Returns:
- DataFrame
Cluster assignment results, with 3 columns:
Data point ID, with name and type taken from the input ID column.
CLUSTER_ID, type INTEGER, representing the cluster the data point is assigned to.
DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.
- create_model_state(model=None, function=None, pal_funcname='PAL_CLUSTER_ASSIGNMENT', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for cluster assignment.
- pal_funcnameint or str, optional
PAL function name. Must be a valid PAL procedure that supports model state.
Defaults to 'PAL_CLUSTER_ASSIGNMENT'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
Inherited Methods from PALBase
Besides those methods mentioned above, the DBSCAN class also inherits methods from PALBase class, please refer to PAL Base for more details.