AffinityPropagation

class hana_ml.algorithms.pal.clustering.AffinityPropagation(affinity, n_clusters, max_iter=None, convergence_iter=None, damping=None, preference=None, seed_ratio=None, times=None, minkowski_power=None, thread_ratio=None)

Affinity Propagation is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges.

Parameters:

affinity{'manhattan', 'standardized_euclidean', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}

Ways to compute the distance between two points.

n_clustersint

Number of clusters.

0: does not adjust Affinity Propagation cluster result.

Non-zero int: If Affinity Propagation cluster number is bigger than n_clusters, PAL will merge the result to make the cluster number be the value specified for n_clusters.

max_iterint, optional

Specifies the maximum number of iterations.

Defaults to 500.

convergence_iterint, optional

Specifies the number of iterations for which cluster stability should be maintained. If the clusters remain stable for the specified number of iterations, the algorithm terminates.

Defaults to 100.

dampingfloat

Controls the updating velocity. Value range: (0, 1).

Defaults to 0.9.

preferencefloat, optional

Determines the preference. Value range: [0,1].

Defaults to 0.5.

seed_ratiofloat, optional

Select a portion of (seed_ratio * data_number) the input data as seed, where data_number is the row_size of the input data.

Value range: (0,1].

If seed_ratio is set to 1, the entire input dataset will be used as seed data.

Defaults to 1.

timesint, optional

Specifies the number of sampling iterations. Only valid when seed_ratio is less than 1.

Defaults to 1.

minkowski_powerint, optional

Specifies the power parameter for the Minkowski distance calculation method. This parameter is relevant only when the 'affinity' is set to 'minkowski'.

Defaults to 3.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Examples

Input DataFrame df:

>>> df.collect()
    ID  ATTRIB1  ATTRIB2
0    1     0.10     0.10
1    2     0.11     0.10
...
22  23    10.13    10.14
23  24    10.14    10.13

Create an AffinityPropagation instance:

>>> ap = AffinityPropagation(
            affinity='euclidean',
            n_clusters=0,
            max_iter=500,
            convergence_iter=100,
            damping=0.9,
            preference=0.5,
            seed_ratio=None,
            times=None,
            minkowski_power=None,
            thread_ratio=1)

Perform fit():

>>> ap.fit(data=df, key='ID')

Expected output:

>>> ap.labels_.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
...
22  23           1
23  24           1

Attributes:

labels_DataFrame: Label assigned to each sample.

Methods

`fit`(data[, key, features])	Fit the model to the training dataset.
`fit_predict`(data[, key, features])	Fit with the dataset and return labels.
`get_model_metrics`()	Get the model metrics.
`get_score_metrics`()	Get the score metrics.

fit(data, key=None, features=None)

Fit the model to the training dataset.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

fit_predict(data, key=None, features=None)

Fit with the dataset and return labels.

Parameters:

dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the features columns.

If features is not provided, it defaults to all non-key columns.

Returns:

DataFrame: Labels of each point.

get_model_metrics()

Get the model metrics.

Returns:

DataFrame: The model metrics.

get_score_metrics()

Get the score metrics.

Returns:

DataFrame: The score metrics.

Inherited Methods from PALBase

Besides those methods mentioned above, the AffinityPropagation class also inherits methods from PALBase class, please refer to PAL Base for more details.