AffinityPropagation
- class hana_ml.algorithms.pal.clustering.AffinityPropagation(affinity, n_clusters, max_iter=None, convergence_iter=None, damping=None, preference=None, seed_ratio=None, times=None, minkowski_power=None, thread_ratio=None)
Affinity Propagation is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges.
- Parameters
- affinity{'manhattan', 'standardized_euclidean', 'euclidean', 'minkowski', 'chebyshev', 'cosine'}
Ways to compute the distance between two points.
- n_clustersint
Number of clusters.
0: does not adjust Affinity Propagation cluster result.
Non-zero int: If Affinity Propagation cluster number is bigger than
n_clusters
, PAL will merge the result to make the cluster number be the value specified forn_clusters
.
No default value as it is mandatory.
- max_iterint, optional
Maximum number of iterations.
Defaults to 500.
- convergence_iterint, optional
When the clusters keep a steady one for the specified times, the algorithm ends.
Defaults to 100.
- dampingfloat
Controls the updating velocity. Value range: (0, 1).
Defaults to 0.9.
- preferencefloat, optional
Determines the preference. Value range: [0,1].
Defaults to 0.5.
- seed_ratiofloat, optional
Select a portion of (seed_ratio * data_number) the input data as seed, where data_number is the row_size of the input data.
Value range: (0,1].
If
seed_ratio
is 1, all the input data will be the seed.Defaults to 1.
- timesint, optional
The sampling times. Only valid when seed_ratio is less than 1.
Defaults to 1.
- minkowski_powerint, optional
The power of the Minkowski method. Only valid when affinity is 3.
Defaults to 3.
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by this function. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
Examples
Input dataframe df for clustering:
>>> df.collect() ID ATTRIB1 ATTRIB2 0 1 0.10 0.10 1 2 0.11 0.10 2 3 0.10 0.11 3 4 0.11 0.11 4 5 0.12 0.11 5 6 0.11 0.12 6 7 0.12 0.12 7 8 0.12 0.13 8 9 0.13 0.12 9 10 0.13 0.13 10 11 0.13 0.14 11 12 0.14 0.13 12 13 10.10 10.10 13 14 10.11 10.10 14 15 10.10 10.11 15 16 10.11 10.11 16 17 10.11 10.12 17 18 10.12 10.11 18 19 10.12 10.12 19 20 10.12 10.13 20 21 10.13 10.12 21 22 10.13 10.13 22 23 10.13 10.14 23 24 10.14 10.13
Create an AffinityPropagation instance:
>>> ap = AffinityPropagation( affinity='euclidean', n_clusters=0, max_iter=500, convergence_iter=100, damping=0.9, preference=0.5, seed_ratio=None, times=None, minkowski_power=None, thread_ratio=1)
Perform fit on the given data:
>>> ap.fit(data=df, key='ID')
Expected output:
>>> ap.labels_.collect() ID CLUSTER_ID 0 1 0 1 2 0 2 3 0 3 4 0 4 5 0 5 6 0 6 7 0 7 8 0 8 9 0 9 10 0 10 11 0 11 12 0 12 13 1 13 14 1 14 15 1 15 16 1 16 17 1 17 18 1 18 19 1 19 20 1 20 21 1 21 22 1 22 23 1 23 24 1
- Attributes
- labels_DataFrame
Label assigned to each sample. structured as follows:
ID, record ID.
CLUSTER_ID, the range is from 0 to
n_clusters
- 1.
Methods
fit
(data[, key, features])Fit the model when given the training dataset.
fit_predict
(data[, key, features])Fit with the dataset and return labels.
- fit(data, key=None, features=None)
Fit the model when given the training dataset.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-key columns.
- fit_predict(data, key=None, features=None)
Fit with the dataset and return labels.
- Parameters
- dataDataFrame
DataFrame containing the data.
- keystr, optional
Name of ID column.
Defaults to the index column of data (i.e. data.index) if it is set. If the index column of data is not provided, please enter the value of key.
- featuresa list of str, optional
Names of the features columns.
If
features
is not provided, it defaults to all non-key columns.
- Returns
- DataFrame
Labels of each point.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.