AffinityPropagation
- class hana_ml.algorithms.pal.clustering.AffinityPropagation(affinity, n_clusters, max_iter=None, convergence_iter=None, damping=None, preference=None, seed_ratio=None, times=None, minkowski_power=None, thread_ratio=None)
Affinity Propagation is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges.
- Parameters:
- affinity{'manhattan', 'standardized_euclidean', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}
Ways to compute the distance between two points.
- n_clustersint
Number of clusters.
0: does not adjust Affinity Propagation cluster result.
Non-zero int: If Affinity Propagation cluster number is bigger than
n_clusters
, PAL will merge the result to make the cluster number be the value specified forn_clusters
.
- max_iterint, optional
Specifies the maximum number of iterations.
Defaults to 500.
- convergence_iterint, optional
Specifies the number of iterations for which cluster stability should be maintained. If the clusters remain stable for the specified number of iterations, the algorithm terminates.
Defaults to 100.
- dampingfloat
Controls the updating velocity. Value range: (0, 1).
Defaults to 0.9.
- preferencefloat, optional
Determines the preference. Value range: [0,1].
Defaults to 0.5.
- seed_ratiofloat, optional
Select a portion of (seed_ratio * data_number) the input data as seed, where data_number is the row_size of the input data.
Value range: (0,1].
If
seed_ratio
is set to 1, the entire input dataset will be used as seed data.Defaults to 1.
- timesint, optional
Specifies the number of sampling iterations. Only valid when
seed_ratio
is less than 1.Defaults to 1.
- minkowski_powerint, optional
Specifies the power parameter for the Minkowski distance calculation method. This parameter is relevant only when the 'affinity' is set to 'minkowski'.
Defaults to 3.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
Examples
Input DataFrame df:
>>> df.collect() ID ATTRIB1 ATTRIB2 0 1 0.10 0.10 1 2 0.11 0.10 ... 22 23 10.13 10.14 23 24 10.14 10.13
Create an AffinityPropagation instance:
>>> ap = AffinityPropagation( affinity='euclidean', n_clusters=0, max_iter=500, convergence_iter=100, damping=0.9, preference=0.5, seed_ratio=None, times=None, minkowski_power=None, thread_ratio=1)
Perform fit():
>>> ap.fit(data=df, key='ID')
Expected output:
>>> ap.labels_.collect() ID CLUSTER_ID 0 1 0 1 2 0 ... 22 23 1 23 24 1
- Attributes:
- labels_DataFrame
Label assigned to each sample.
Methods
fit
(data[, key, features])Fit the model to the training dataset.
fit_predict
(data[, key, features])Fit with the dataset and return labels.
Get the model metrics.
Get the score metrics.
- fit(data, key=None, features=None)
Fit the model to the training dataset.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-key columns.
- fit_predict(data, key=None, features=None)
Fit with the dataset and return labels.
- Parameters:
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of the features columns.
If
features
is not provided, it defaults to all non-key columns.
- Returns:
- DataFrame
Labels of each point.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the AffinityPropagation class also inherits methods from PALBase class, please refer to PAL Base for more details.