Madhuka: AffinityPropagation Clustering Algorithm

Affinity Propagation (AP)[1] is a relatively new clustering algorithm based on the concept of "message passing" between data points. AP does not require the number of clusters to be determined or estimated before running the algorithm.

“An algorithm that identifies exemplars among data points and forms clusters of datapoints around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a
good set of exemplars and clusters emerges.”[1]

Let x₁ through x_n be a set of data points, with no assumptions made about their internal structure, and let s be a function that quantifies the similarity between any two points, such that s(x_i, x_j) > s(x_i, x_k) iff x_i is more similar to x_j than to x_k.

Algorithm

The algorithm proceeds by alternating two message passing steps, to update two matrices

The "responsibility" matrix R has values r(i, k) that quantify how well-suited x_k is to serve as the exemplar for x_i, relative to other candidate exemplars for x_i.

First, responsibility updates by below function

$r(i,k) \leftarrow s(i,k) - \max_{k' \neq k} \left\{ a(i,k') + s(i,k') \right\}$

The "availability" matrix A contains values a(i, k) represents how "appropriate" it would be for x_i to pick x_k as its exemplar, taking into account other points' preference for x_kas an exemplar.

Availability is updated

$a(i,k) \leftarrow \min \left( 0, r(k,k) + \sum_{i' \not\in \{i,k\}} \max(0, r(i',k)) \right)$ for $i \neq k$ and

$a(k,k) \leftarrow \sum_{i' \neq k} \max(0, r(i',k))$ .

Input for Algo is {s(i, j)}i,j∈{1,...,N} (data similarities and preferences)

Both matrices are initialized to all zeroes.

Let is Implement the Algorithm.

I will be using python sklearn.cluster.AffinityPropagation. I will using my previously[2] generated data set.

1 # Compute Affinity Propagation
2 af = AffinityPropagation().fit(X)

Parameters

All parameters are optional

damping : Damping factor between 0.5 and 1 (float, default: 0.5)

convergence_iter : Number of iterations with no change in the number of estimated clusters (int, optional, default: 15)
max_iter : Maximum number of iterations. (int, default: 200)

copy : Make a copy of input data (boolean, default: True)

preference : Preferences for each point - points with larger values of preferences are more likely to be chosen as exemplars. The number of exemplars, ie. of clusters, is influenced by the input preferences value. If the preferences are not passed as arguments, they will be set to the median of the input similarities. (array-like, shape (n_samples,) or float)

affinity : Which affinity to use. At the moment `precomputed` and `euclidean` are supported. (string, optional, default=`euclidean`)

verbose : Whether to be verbose (boolean, default: False)

Implementation can be found in here[4]

Attributes

cluster_centers_indices_ : Indices of cluster centers (array)

cluster_centers_ : Cluster centers (array)

labels_ : Labels of each point (array)

affinity_matrix_ : Stores the affinity matrix used in `fit` (array)

n_iter_ : Number of iterations taken to converge (int)

I will be using same result comparison variables that we used for DBSCAN[2]. Charting will be update for AF.

Estimated number of clusters: 6
Homogeneity: 1.000
Completeness: 0.801
V-measure: 0.890
Adjusted Rand Index: 0.819
Adjusted Mutual Information: 0.799
Silhouette Coefficient: 0.574