Wednesday, April 29, 2015

AffinityPropagation Clustering Algorithm

Affinity Propagation (AP)[1] is a relatively new clustering algorithm based on the concept of "message passing" between data points. AP does not require the number of clusters to be determined or estimated before running the algorithm.

“An algorithm that identifies exemplars among data points and forms clusters of datapoints around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a
good set of exemplars and clusters emerges.”[1]

 

Let x1 through xn be a set of data points, with no assumptions made about their internal structure, and let s be a function that quantifies the similarity between any two points, such that s(xi, xj) > s(xi, xk) iff xi is more similar to xj than to xk.

Algorithm

The algorithm proceeds by alternating two message passing steps, to update two matrices

  • The "responsibility" matrix R has values r(i, k) that quantify how well-suited xk is to serve as the exemplar for xi, relative to other candidate exemplars for xi.
    • First, responsibility updates by below function

r(i,k) \leftarrow s(i,k) - \max_{k' \neq k} \left\{ a(i,k') + s(i,k') \right\}

  • The "availability" matrix A contains values a(i, k) represents how "appropriate" it would be for xi to pick xk as its exemplar, taking into account other points' preference for xkas an exemplar.
    • Availability is updated
a(i,k) \leftarrow \min \left( 0, r(k,k) + \sum_{i' \not\in \{i,k\}} \max(0, r(i',k)) \right) for i \neq k and
a(k,k) \leftarrow \sum_{i' \neq k} \max(0, r(i',k)).

Input for Algo is {s(i, j)}i,j∈{1,...,N} (data similarities and preferences)

Both matrices are initialized to all zeroes.

 

Let is Implement the Algorithm.

I will be using python sklearn.cluster.AffinityPropagation. I will using my previously[2] generated data set.

1 # Compute Affinity Propagation
2 af = AffinityPropagation().fit(X)

Parameters


All parameters are optional



  • damping : Damping factor between 0.5 and 1 (float, default: 0.5)

  • convergence_iter : Number of iterations with no change in the number of estimated clusters (int, optional, default: 15)
    max_iter : Maximum number of iterations. (int, default: 200) 

  • copy : Make a copy of input data (boolean, default: True)

  • preference : Preferences for each point - points with larger values of preferences are more likely to be chosen as exemplars. The number of exemplars, ie. of clusters, is influenced by the input preferences value. If the preferences are not passed as arguments, they will be set to the median of the input similarities. (array-like, shape (n_samples,) or float)

  • affinity : Which affinity to use. At the moment `precomputed` and `euclidean` are supported.  (string, optional, default=`euclidean`)

  • verbose : Whether to be verbose (boolean, default: False)

Implementation can be found in here[4]


Attributes



  • cluster_centers_indices_ : Indices of cluster centers (array)

  • cluster_centers_ : Cluster centers  (array)

  • labels_ : Labels of each point (array)

  • affinity_matrix_ : Stores the affinity matrix used in `fit` (array)

  • n_iter_ : Number of iterations taken to converge (int)

I will be using same result comparison variables that we used for DBSCAN[2]. Charting will be update for AF.  


Estimated number of clusters: 6
Homogeneity: 1.000
Completeness: 0.801
V-measure: 0.890
Adjusted Rand Index: 0.819
Adjusted Mutual Information: 0.799
Silhouette Coefficient: 0.574


image


When data set is more spread (sd) from 0.5 to 0.9 


image


sample dataset center point are [[5, 5], [0, 0], [1, 5],[5, -1]]. Let try to turning algo parameters for to get better clustering


Let is see the effect of iteration in AF



When Iteration is 30                                                                           Iteration Is 75


imageimage



150 Iterations                                                                        200 Iterations


imageimage


Gist : https://gist.github.com/Madhuka/2e27dce9680f42619b83#file-affinity-propagation-py


References


[1] Brendan J. Frey; Delbert Dueck (2007). "Clustering by passing messages between data points". Science 315 (5814): 972–976.


[2] http://madhukaudantha.blogspot.com/2015/04/density-based-clustering-algorithm.html





[3] http://www.cs.columbia.edu/~delbert/docs/DDueck-thesis_small.pdf


[4] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/affinity_propagation_.py#L256

No comments:

Post a Comment