HDBSCAN
- class hana_ml.algorithms.pal.clustering.HDBSCAN(thread_ratio=None, min_cluster_size=None, max_cluster_size=None, min_sample=None, cluster_selection_eps=None, allow_single_cluster=False, metric=None, minkowski_power=None, category_weights=None, algorithm=None, min_leaf_kdtree=None)
DBSCAN(Density-Based Spatial Clustering of Applications with Noise) is a popular algorithm in clustering. HDBSCAN(Hierarchical Density-Based Spatial Clustering of Applications with Noise) on the other hand is a novel algorithm as well based on the idea of DENSITY but in a hierarchical way. It first builds a hierarchical structure of clusters based on the density of points, which includes all possible splitting ways of points over different densities. Then instead of selecting clusters based on some fixed density like DBSCAN does, HDBSCAN selects a set of flat clusters from the structure built before aiming at maximizing a so called concept STABILITY. Due to the special selection method applied, HDBSCAN has the ability of extracting clusters with different densities, which gives it the advantage of more robust. And it discards the hyper-parameter "radius" in DBSCAN which is difficult to choose for deciding a proper density.
- Parameters:
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.0.
- min_cluster_sizeint, optional
Specifies the minimum number of points in a cluster.
Defaults to 5.
- max_cluster_sizeint, optional
Specifies the maximum number of points in a cluster. 0 means no limitation. This size might be overridden in some cases.
Defaults to 0.
- min_sampleint, optional
A heuristic value to indicate the minimum density to form a cluster.
Defaults to value of
min_cluster_size
.- cluster_selection_epsfloat, optional
Clusters of linking value less than this parameter will be merged. Used to merge small clusters. 0.0 means no such merging.
Defaults to 0.0.
- allow_single_clusterbool, optional
Indicates whether allow the whole dataset to form a single cluster.
Defaults to False.
- metric{'Manhattan', 'Euclidean', 'Minkowski', 'Chebyshev'}, optional
Indicates using which metric to measure distance between two points.
Defaults to 'Euclidean'.
- minkowski_powerfloat, optional
When minkowski is chosen for
metric
, this parameter controls the value of power. Only applicable whenmetric
is 'minkowski'.Defaults to 3.0.
- category_weightsfloat, optional
Represents the weight of categorical variable.
Defaults to 0.707.
- algorithm{'brute-force', 'kd-tree'}, optional
Specifies the method to accomplish HDBSCAN. optionas are 'brute-force' and 'kd-tree'. KD tree can be used to accelerate the process when N >> 2^D, where N is the number of points and D is the number of continuous dimensions of data.
Defaults to 'brute-force'.
- min_leaf_kdtreeint, optional
KD tree related parameter to specify the minimum number of points contained in a leaf node.
Defaults to 16.
Examples
>>> hdbscan = HDBSCAN(min_cluster_size=3, min_sample=3, allow_single_cluster=True,algorithm='brute-force', metric='euclidean', thread_ratio=1.0)
Perform fit():
>>> hdbscan.fit(data=df, key='ID') >>> hdbscan.labels_.collect()
Perform fit_predict():
>>> res = hdbscan.fit_predict(data=df, key='ID') >>> res.collect()
- Attributes:
- labels_DataFrame
Label assigned to each sample.
Methods
fit
(data[, key, features, categorical_variable])Fit the HDBSCAN clustering to the training dataset.
fit_predict
(data[, key, features, ...])Invoke HDBSCAN to the given data and return the cluster labels.
- fit(data, key=None, features=None, categorical_variable=None)
Fit the HDBSCAN clustering to the training dataset.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-key columns.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- Returns:
- A fitted object of class "HDBSCAN".
- fit_predict(data, key=None, features=None, categorical_variable=None)
Invoke HDBSCAN to the given data and return the cluster labels.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
A list of Names of the feature columns. Since the introduction of SAP HANA Cloud 24 QRC03, the data type support for features has been expanded to include VECTOR TYPE, in addition to the previously supported types such as INTEGER, DOUBLE, DECIMAL(p, s), VARCHAR, and NVARCHAR.
If
features
is not provided, it defaults to all non-key columns. This means that all columns except the key column will be considered as features.- categorical_variablestr or a list of str, optional
Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.
No default value.
- Returns:
- DataFrame
Label assigned to each sample.
Inherited Methods from PALBase
Besides those methods mentioned above, the HDBSCAN class also inherits methods from PALBase class, please refer to PAL Base for more details.