HDBSCAN

class hana_ml.algorithms.pal.clustering.HDBSCAN(thread_ratio=None, min_cluster_size=None, max_cluster_size=None, min_sample=None, cluster_selection_eps=None, allow_single_cluster=False, metric=None, minkowski_power=None, category_weights=None, algorithm=None, min_leaf_kdtree=None)

DBSCAN(Density-Based Spatial Clustering of Applications with Noise) is a popular algorithm in clustering. HDBSCAN(Hierarchical Density-Based Spatial Clustering of Applications with Noise) on the other hand is a novel algorithm as well based on the idea of DENSITY but in a hierarchical way. It first builds a hierarchical structure of clusters based on the density of points, which includes all possible splitting ways of points over different densities. Then instead of selecting clusters based on some fixed density like DBSCAN does, HDBSCAN selects a set of flat clusters from the structure built before aiming at maximizing a so called concept STABILITY. Due to the special selection method applied, HDBSCAN has the ability of extracting clusters with different densities, which gives it the advantage of more robust. And it discards the hyper-parameter "radius" in DBSCAN which is difficult to choose for deciding a proper density.

Parameters:

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 1.0.

min_cluster_sizeint, optional

Specifies the minimum number of points in a cluster.

Defaults to 5.

max_cluster_sizeint, optional

Specifies the maximum number of points in a cluster. 0 means no limitation. This size might be overridden in some cases.

Defaults to 0.

min_sampleint, optional

A heuristic value to indicate the minimum density to form a cluster.

Defaults to value of min_cluster_size.

cluster_selection_epsfloat, optional

Clusters of linking value less than this parameter will be merged. Used to merge small clusters. 0.0 means no such merging.

Defaults to 0.0.

allow_single_clusterbool, optional

Indicates whether allow the whole dataset to form a single cluster.

Defaults to False.

metric{'Manhattan', 'Euclidean', 'Minkowski', 'Chebyshev'}, optional

Indicates using which metric to measure distance between two points.

Defaults to 'Euclidean'.

minkowski_powerfloat, optional

When minkowski is chosen for metric, this parameter controls the value of power. Only applicable when metric is 'minkowski'.

Defaults to 3.0.

category_weightsfloat, optional

Represents the weight of categorical variable.

Defaults to 0.707.

algorithm{'brute-force', 'kd-tree'}, optional

Specifies the method to accomplish HDBSCAN. optionas are 'brute-force' and 'kd-tree'. KD tree can be used to accelerate the process when N >> 2^D, where N is the number of points and D is the number of continuous dimensions of data.

Defaults to 'brute-force'.

min_leaf_kdtreeint, optional

KD tree related parameter to specify the minimum number of points contained in a leaf node.

Defaults to 16.

Attributes:

labels_DataFrame: Label assigned to each sample.

Methods

`fit`(data[, key, features, categorical_variable])	Fit the HDBSCAN clustering to the training dataset.
`fit_predict`(data[, key, features, ...])	Invoke HDBSCAN to the given data and return the cluster labels.

Examples

>>> hdbscan = HDBSCAN(min_cluster_size=3, min_sample=3,
                     allow_single_cluster=True,algorithm='brute-force',
                     metric='euclidean', thread_ratio=1.0)

Perform fit():

>>> hdbscan.fit(data=df, key='ID')
>>> hdbscan.labels_.collect()

Perform fit_predict():

>>> res = hdbscan.fit_predict(data=df, key='ID')
>>> res.collect()

fit(data, key=None, features=None, categorical_variable=None)

Fit the HDBSCAN clustering to the training dataset.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

Returns:

A fitted object of class "HDBSCAN".

fit_predict(data, key=None, features=None, categorical_variable=None)

Invoke HDBSCAN to the given data and return the cluster labels.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

A list of Names of the feature columns. Since the introduction of SAP HANA Cloud 24 QRC03, the data type support for features has been expanded to include VECTOR TYPE, in addition to the previously supported types such as INTEGER, DOUBLE, DECIMAL(p, s), VARCHAR, and NVARCHAR.

If features is not provided, it defaults to all non-key columns. This means that all columns except the key column will be considered as features.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

Returns:

DataFrame: Label assigned to each sample.

Inherited Methods from PALBase

Besides those methods mentioned above, the HDBSCAN class also inherits methods from PALBase class, please refer to PAL Base for more details.