AffinityPropagation

class hana_ml.algorithms.pal.clustering.AffinityPropagation(affinity, n_clusters, max_iter=None, convergence_iter=None, damping=None, preference=None, seed_ratio=None, times=None, minkowski_power=None, thread_ratio=None)

Affinity Propagation is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges.

Parameters
affinity{'manhattan', 'standardized_euclidean', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}

Ways to compute the distance between two points.

n_clustersint

Number of clusters.

  • 0: does not adjust Affinity Propagation cluster result.

  • Non-zero int: If Affinity Propagation cluster number is bigger than n_clusters, PAL will merge the result to make the cluster number be the value specified for n_clusters.

No default value as it is mandatory.

max_iterint, optional

Maximum number of iterations.

Defaults to 500.

convergence_iterint, optional

When the clusters keep a steady one for the specified times, the algorithm ends.

Defaults to 100.

dampingfloat

Controls the updating velocity. Value range: (0, 1).

Defaults to 0.9.

preferencefloat, optional

Determines the preference. Value range: [0,1].

Defaults to 0.5.

seed_ratiofloat, optional

Select a portion of (seed_ratio * data_number) the input data as seed, where data_number is the row_size of the input data.

Value range: (0,1].

If seed_ratio is 1, all the input data will be the seed.

Defaults to 1.

timesint, optional

The sampling times. Only valid when seed_ratio is less than 1.

Defaults to 1.

minkowski_powerint, optional

The power of the Minkowski method. Only valid when affinity is 3.

Defaults to 3.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

Examples

Input dataframe df for clustering:

>>> df.collect()
    ID  ATTRIB1  ATTRIB2
0    1     0.10     0.10
1    2     0.11     0.10
2    3     0.10     0.11
3    4     0.11     0.11
4    5     0.12     0.11
5    6     0.11     0.12
6    7     0.12     0.12
7    8     0.12     0.13
8    9     0.13     0.12
9   10     0.13     0.13
10  11     0.13     0.14
11  12     0.14     0.13
12  13    10.10    10.10
13  14    10.11    10.10
14  15    10.10    10.11
15  16    10.11    10.11
16  17    10.11    10.12
17  18    10.12    10.11
18  19    10.12    10.12
19  20    10.12    10.13
20  21    10.13    10.12
21  22    10.13    10.13
22  23    10.13    10.14
23  24    10.14    10.13

Create an AffinityPropagation instance:

>>> ap = AffinityPropagation(
            affinity='euclidean',
            n_clusters=0,
            max_iter=500,
            convergence_iter=100,
            damping=0.9,
            preference=0.5,
            seed_ratio=None,
            times=None,
            minkowski_power=None,
            thread_ratio=1)

Perform fit on the given data:

>>> ap.fit(data=df, key='ID')

Expected output:

>>> ap.labels_.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
3    4           0
4    5           0
5    6           0
6    7           0
7    8           0
8    9           0
9   10           0
10  11           0
11  12           0
12  13           1
13  14           1
14  15           1
15  16           1
16  17           1
17  18           1
18  19           1
19  20           1
20  21           1
21  22           1
22  23           1
23  24           1
Attributes
labels_DataFrame

Label assigned to each sample. structured as follows:

  • ID, record ID.

  • CLUSTER_ID, the range is from 0 to n_clusters - 1.

Methods

fit(data[, key, features])

Fit the model when given the training dataset.

fit_predict(data[, key, features])

Fit with the dataset and return labels.

fit(data, key=None, features=None)

Fit the model when given the training dataset.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-key columns.

fit_predict(data, key=None, features=None)

Fit with the dataset and return labels.

Parameters
dataDataFrame

DataFrame containing the data.

keystr, optional

Name of ID column.

Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.

featuresa list of str, optional

Names of the features columns.

If features is not provided, it defaults to all non-key columns.

Returns
DataFrame

Labels of each point.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the AffinityPropagation class also inherits methods from PALBase class, please refer to PAL Base for more details.