DBSCAN

class hana_ml.algorithms.pal.clustering.DBSCAN(minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, categorical_variable=None, category_weights=None, algorithm=None, save_model=True)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.

Parameters
minptsint, optional

The minimum number of points required to form a cluster.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

epsfloat, optional

The scan radius.

Note

minpts and eps need to be provided together by user or these two parameters are automatically determined.

thread_ratiofloat, optional

Controls the proportion of available threads to use.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside this range tell PAL to heuristically determine the number of threads to use.

Defaults to heuristically determined.

metric{'manhattan', 'euclidean', 'minkowski', 'chebyshev', 'standardized_euclidean', 'cosine'}, optional

Ways to compute the distance between two points.

Defaults to 'euclidean'.

minkowski_powerint, optional

When minkowski is chosen for metric, this parameter controls the value of power. Only applicable when metric is minkowski.

Defaults to 3.

categorical_variablestr or a list of str, optional

Specifies column(s) in the data that should be treated as categorical.

Defaults to None.

category_weightsfloat, optional

Represents the weight of category attributes.

Defaults to 0.707.

algorithm{'brute-force', 'kd-tree'}, optional

Ways to search for neighbours.

Defaults to 'kd-tree'.

save_modelbool, optional

If true, the generated model will be saved.

save_model must be True in order to call predict().

Defaults to True.

Examples

Input dataframe df for clustering:

>>> df.collect()
    ID     V1     V2 V3
0    1   0.10   0.10  B
1    2   0.11   0.10  A
2    3   0.10   0.11  C
3    4   0.11   0.11  B
4    5   0.12   0.11  A
5    6   0.11   0.12  E
6    7   0.12   0.12  A
7    8   0.12   0.13  C
8    9   0.13   0.12  D
9   10   0.13   0.13  D
10  11   0.13   0.14  A
11  12   0.14   0.13  C
12  13  10.10  10.10  A
13  14  10.11  10.10  F
14  15  10.10  10.11  E
15  16  10.11  10.11  E
16  17  10.11  10.12  A
17  18  10.12  10.11  B
18  19  10.12  10.12  B
19  20  10.12  10.13  D
20  21  10.13  10.12  F
21  22  10.13  10.13  A
22  23  10.13  10.14  A
23  24  10.14  10.13  D
24  25   4.10   4.10  A
25  26   7.11   7.10  C
26  27  -3.10  -3.11  C
27  28  16.11  16.11  A
28  29  20.11  20.12  C
29  30  15.12  15.11  A

Create DSBCAN instance:

>>> dbscan = DBSCAN(thread_ratio=0.2, metric='manhattan')

Perform fit on the given data:

>>> dbscan.fit(data=df, key='ID')

Expected output:

>>> dbscan.labels_.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
3    4           0
4    5           0
5    6           0
6    7           0
7    8           0
8    9           0
9   10           0
10  11           0
11  12           0
12  13           1
13  14           1
14  15           1
15  16           1
16  17           1
17  18           1
18  19           1
19  20           1
20  21           1
21  22           1
22  23           1
23  24           1
24  25          -1
25  26          -1
26  27          -1
27  28          -1
28  29          -1
29  30          -1
Attributes
labels_DataFrame

Label assigned to each sample.

model_DataFrame

Model content. Set to None if save_model is False.

Methods

create_model_state([model, function, ...])

Create PAL model state.

delete_model_state([state])

Delete PAL model state.

fit(data[, key, features, ...])

Fit the DBSCAN model when given the training dataset.

fit_predict(data[, key, features, ...])

Fit with the dataset and return the labels.

predict(data[, key, features])

Assign clusters to data based on a fitted model.

set_model_state(state)

Set the model state by state information.

fit(data, key=None, features=None, categorical_variable=None, string_variable=None, variable_weight=None)

Fit the DBSCAN model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical.

Other INTEGER columns will be treated as continuous.

Defaults to None.

string_variablestr or a list of str, optional

Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0. Defaults to 1 for variables not specified.

Defaults to None.

Returns
A fitted object of class "DBSCAN".
fit_predict(data, key=None, features=None, categorical_variable=None, string_variable=None, variable_weight=None)

Fit with the dataset and return the labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresstr or a list of str, optional

Names of the features columns.

If features is not provided, it defaults to all the non-key columns.

categorical_variablestr or a list of str, optional

Specifies INTEGER column(s) that should be treated as categorical. Other INTEGER columns will be treated as continuous.

Defaults to None.

string_variablestr or a list of str, optional

Indicates a string column storing not categorical data. Levenshtein distance is used to calculate similarity between two strings. Ignored if it is not a string column.

Defaults to None.

variable_weightdict, optional

Specifies the weight of a variable participating in distance calculation. The value must be greater or equal to 0.

Defaults to 1 for variables not specified.

Returns
DataFrame

Label assigned to each sample.

predict(data, key=None, features=None)

Assign clusters to data based on a fitted model. The output structure of this method does not match that of fit_predict().

Parameters
dataDataFrame

Data points to match against computed clusters.

This dataframe's column structure should match that of the data used for fit().

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional.

Names of feature columns.

If features is not provided, it defaults to all non-key columns.

Returns
DataFrame

Cluster assignment results, with 3 columns:

  • Data point ID, with name and type taken from the input ID column.

  • CLUSTER_ID, type INTEGER, representing the cluster the data point is assigned to.

  • DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.

create_model_state(model=None, function=None, pal_funcname='PAL_CLUSTER_ASSIGNMENT', state_description=None, force=False)

Create PAL model state.

Parameters
modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for cluster assignment.

pal_funcnameint or str, optional

PAL function name. Must be a valid PAL procedure that supports model state.

Defaults to 'PAL_CLUSTER_ASSIGNMENT'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

delete_model_state(state=None)

Delete PAL model state.

Parameters
stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

set_model_state(state)

Set the model state by state information.

Parameters
state: DataFrame or dict

If state is DataFrame, it has the following structure:

  • NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

  • VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

Inherited Methods from PALBase

Besides those methods mentioned above, the DBSCAN class also inherits methods from PALBase class, please refer to PAL Base for more details.