DBSCAN

class hana_ml.algorithms.pal.clustering.DBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, categorical_variable=None, category_weights=None, algorithm=None, save_model=True)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.

It separates high-density regions from low-density ones, allowing it to discover clusters of arbitrary shape in data containing noise and outliers.

Parameters:
minptsint, optional

Represents the minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

epsfloat, optional

Specifies the scanning radius around a point.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to heuristically determined.

metric{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'standardized_euclidean', 'cosine'}, optional

Ways to compute the distance between two points.

Defaults to 'euclidean'.

minkowski_powerint, optional

When minkowski is chosen for metric, this parameter controls the value of power. Only applicable when metric is 'minkowski'.

Defaults to 3.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

algorithm{'brute-force', 'kd-tree'}, optional

Represents the chosen method to search for neighbouring data points.

Defaults to 'kd-tree'.

save_modelbool, optional

If set to true, the generated model will be saved. It must be mentioned that save_model has to be set to true in order to utilize the function predict().

Defaults to True.

Examples

Input DataFrame df:

>>> df.collect()
    ID     V1     V2 V3
0    1   0.10   0.10  B
1    2   0.11   0.10  A
...
28  29  20.11  20.12  C
29  30  15.12  15.11  A

Create a DSBCAN instance:

>>> dbscan = DBSCAN(thread_ratio=0.2, metric='manhattan')

Perform fit():

>>> dbscan.fit(data=df, key='ID')

Output:

>>> dbscan.labels_.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
...
27  28          -1
28  29          -1
29  30          -1
Attributes:
labels_DataFrame

Label assigned to each sample.

model_DataFrame

Model content. Set to None if save_model is False.

Methods

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

fit(data[, key, features, ...])

Fit the model to the training dataset.

fit_predict(data[, key, features, ...])

Fit with the dataset and return the labels.

predict(data[, key, features])

Assign clusters to data based on a fitted model.

set_model_state(state)

Set the model state by state information.

fit(data, key=None, features=None, categorical_variable=None, string_variable=None, variable_weight=None)

Fit the model to the training dataset.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

string_variablestr or a a list of str, optional

Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified.

Defaults to None.

Returns:
A fitted object of class "DBSCAN".
fit_predict(data, key=None, features=None, categorical_variable=None, string_variable=None, variable_weight=None)

Fit with the dataset and return the labels.

Parameters:
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr or a list of str, optional

Names of the features columns.

If features is not provided, it defaults to all the non-key columns.

categorical_variablestr or a list of str, optional

Specifies which INTEGER columns should be treated as categorical, with all other INTEGER columns treated as continuous.

No default value.

string_variablestr or a list of str, optional

Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0.

Defaults to 1 for variables not specified.

Returns:
DataFrame

Label assigned to each sample.

predict(data, key=None, features=None)

Assign clusters to data based on a fitted model. The output structure of this method does not match that of fit_predict().

Parameters:
dataDataFrame

Data points to match against computed clusters.

This dataframe's column structure should match that of the data used for fit().

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional.

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

Returns:
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, type INTEGER, representing the cluster the data point is assigned to.

  • DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.

create_model_state(model=None, function=None, pal_funcname='PAL_CLUSTER_ASSIGNMENT', state_description=None, force=False)

Create PAL model state.

Parameters:
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for cluster assignment.

pal_funcnameint or str, optional

PAL function name. Must be a valid PAL procedure that supports model state.

Defaults to 'PAL_CLUSTER_ASSIGNMENT'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters:
stateDataFrame, optional

Specified the state.

Defaults to self.state.

set_model_state(state)

Set the model state by state information.

Parameters:
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the DBSCAN class also inherits methods from PALBase class, please refer to PAL Base for more details.