Tuesday, April 7, 2015

Density-based clustering algorithm (DBSAN) and Implementation

Density-based spatial clustering of applications with noise (DBSCAN)[1] is a density-based clustering algorithm. It gives a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions. In 2014, the algorithm was awarded the test of time award at the leading data mining conference, KDD.

Density Definition

  • ε (eps) Neighborhood – Objects within a radius of ε from an object
  • “High density” - ε-Neighborhood of an object contains at least MinPts of objects

image

 

Core, Border & Outlier

A point is a core point if it has more than a specified number of points (MinPts) within Eps—These are points that are at the interior of a cluster.

A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point.

 

Density-reachability

Reachability is not a symmetric relation since, by definition, no point may be reachable from a non-core point, regardless of distance. Two points p and q are density-connected if there is a point o such that both p and q are density-reachable from o. Density-connectedness is symmetric.

Let check with example

image
– A point p is directly density-reachable from p2
– p2 is directly density-reachable from p1
– p1 is directly density-reachable from q
– p <- p2 <- p1 <- q form a chain

p is (indirectly) density-reachable from q
q is not density-reachable from p

 

Algorithm

DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      NeighborPts = regionQuery(P, eps)
      if sizeof(NeighborPts) < MinPts
         mark P as NOISE
      else
         C = next cluster
         expandCluster(P, NeighborPts, C, eps, MinPts)
         
expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P' in NeighborPts
      if P' is not visited
         mark P' as visited
         NeighborPts' = regionQuery(P', eps)
         if sizeof(NeighborPts') >= MinPts
            NeighborPts = NeighborPts joined with NeighborPts'
      if P' is not yet member of any cluster
         add P' to cluster C
         
regionQuery(P, eps)
   return all points within P's eps-neighborhood (including P)

 

Let is Implement the Algorithm.

I will be using python sklearn.cluster.DBSCAN. Perform DBSCAN clustering from vector array or distance matrix. I will be using my sample data that was genrated from Gaussian blobs and it is explain and code can found in my previous post[2]

1 from sklearn.cluster import DBSCAN
2
3 db = DBSCAN(eps=0.3, min_samples=10).fit(X)

Parameters



  • eps : The maximum distance between two samples for them to be considered as in the same neighborhood (float, optional).

  • min_samples : The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself. (int, optional)

  • metric : The metric to use when calculating distance between instances in a feature array. (string, or callable)

  • algorithm : The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors.
    ({‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional)

  • leaf_size :Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required. (int, optional,default = 30)


Attributes



  • core_sample_indices_ : Indixes of core samples. (array, shape = [n_core_samples])

  • components_ : Copy of each core sample found by training. (array, shape = [n_core_samples, n_features])

  • labels_ : Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1. (array, shape = [n_samples])

Memory complexity



  • Memory complexity to O(n.d) where d is the average number of neighbors, while original DBSCAN had memory complexity O(n).

Methods



  • fit(X[, y, sample_weight]) :    Perform DBSCAN clustering from features or distance matrix.

  • fit_predict(X[, y, sample_weight]): Performs clustering on X and returns cluster labels.

  • get_params([deep]) : Get parameters for this estimator.

  • set_params(**params) : Set the parameters of this estimator.


1 # Compute DBSCAN
2 import numpy as np
3 from sklearn.cluster import DBSCAN
4 from sklearn import metrics
5
6
7 db = DBSCAN(eps=0.3, min_samples=10).fit(X)
8
9 #zeros_like :Return an array of zeros with the same shape and type as a given array., dtype will overrides the data type of the result.
10 core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
11
12 #core_sample_indices_ : Attributes and it is index of core samples (array, shape = [n_core_samples])
13 core_samples_mask[db.core_sample_indices_] = True
14 labels = db.labels_
15
16 # Number of clusters in labels, ignoring noise if present.
17 n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
18
19 #print results
20 print('Estimated number of clusters: %d' % n_clusters_)
21 print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
22 print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
23 print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
24 print("Adjusted Rand Index: %0.3f"
25 % metrics.adjusted_rand_score(labels_true, labels))
26 print("Adjusted Mutual Information: %0.3f"
27 % metrics.adjusted_mutual_info_score(labels_true, labels))
28 print("Silhouette Coefficient: %0.3f"
29 % metrics.silhouette_score(X, labels))


Results for DBSCAN is compared by following



  • Homogeneity
    A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.

  • Completeness
    A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.

  • V-measure[3]
    The V-measure is the harmonic mean between homogeneity and completeness.
    v = 2 * (homogeneity * completeness) / (homogeneity + completeness)

  • The Rand Index
    The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are Rand index adjusted for chance.assigned in the same or different clusters in the predicted and true clusterings.

  • Adjusted Mutual Information (AMI)[4]
    It is an adjustment of the Mutual Information (MI) score to account for chance. It accounts for the fact that the MI is generally higher for two clusterings with a larger number of clusters, regardless of whether there is actually more information shared.

  • Silhouette Coefficient[5]
    The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficent is only defined if number of labels is 2 <= n_labels <= n_samples - 1.

Sample outcome from the data


image


 


Charting for DBSCAN


Here is simple chart for to show DBSCAN results noised data will be in black and other all Cluster point have their colors.


1 import matplotlib.pyplot as plt
2
3 # Black removed and is used for noise instead.
4 unique_labels = set(labels)
5 colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
6 for k, col in zip(unique_labels, colors):
7 if k == -1:
8 # Black used for noise.
9 col = 'k'
10
11 class_member_mask = (labels == k)
12
13 xy = X[class_member_mask & core_samples_mask]
14 plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
15 markeredgecolor='k', markersize=14)
16
17 xy = X[class_member_mask & ~core_samples_mask]
18 plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
19 markeredgecolor='k', markersize=6)
20
21 plt.title('Estimated number of clusters: %d' % n_clusters_)
22 plt.show()

image


Here is full Code


1 import matplotlib.pyplot as plt
2 import numpy as np
3 from sklearn.cluster import DBSCAN
4 from sklearn import metrics
5 from sklearn.datasets.samples_generator import make_blobs
6
7 # generating sampl data
8 centers = [[5, 5], [0, 0], [1, 5],[5, -1]]
9 X, labels_true =make_blobs(n_samples=500, n_features=5, centers=centers, cluster_std=0.9, center_box=(1, 10.0), shuffle=True, random_state=0)
10
11
12 # Compute DBSCAN
13 db = DBSCAN(eps=0.5, min_samples=10).fit(X)
14
15 #zeros_like :Return an array of zeros with the same shape and type as a given array., dtype will overrides the data type of the result.
16 core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
17
18 #core_sample_indices_ : Attributes and it is index of core samples (array, shape = [n_core_samples])
19 core_samples_mask[db.core_sample_indices_] = True
20 labels = db.labels_
21
22 # Number of clusters in labels, ignoring noise if present.
23 n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
24
25 #print results
26 print('Estimated number of clusters: %d' % n_clusters_)
27 print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
28 print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
29 print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
30 print("Adjusted Rand Index: %0.3f"% metrics.adjusted_rand_score(labels_true, labels))
31 print("Adjusted Mutual Information: %0.3f"% metrics.adjusted_mutual_info_score(labels_true, labels))
32 print("Silhouette Coefficient: %0.3f"% metrics.silhouette_score(X, labels))
33
34
35 # Drawing chart
36 # Black removed and is used for noise instead.
37 unique_labels = set(labels)
38 colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
39 for k, col in zip(unique_labels, colors):
40 if k == -1:
41 # Black used for noise.
42 col = 'k'
43
44 class_member_mask = (labels == k)
45
46 xy = X[class_member_mask & core_samples_mask]
47 plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
48 markeredgecolor='k', markersize=14)
49
50 xy = X[class_member_mask & ~core_samples_mask]
51 plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
52 markeredgecolor='k', markersize=6)
53
54 plt.title('Estimated number of clusters: %d' % n_clusters_)
55 plt.show()

Gist :


code : https://gist.github.com/Madhuka/b1d55804bb5d23c94103


 


References


[1] Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M., eds. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231. ISBN 1-57735-004-9. CiteSeerX: 10.1.1.71.1980.


[2] http://madhukaudantha.blogspot.com/2015/04/scikit-learn-to-generate-isotropic.html


[3] Rosenberg, Andrew, and Julia Hirschberg. "V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure." EMNLP-CoNLL. Vol. 7. 2007.


[4]Vinh, Nguyen Xuan, Julien Epps, and James Bailey. "Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance." The Journal of Machine Learning Research 11 (2010): 2837-2854.


[5] Rousseeuw, Peter J. "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis." Journal of computational and applied mathematics 20 (1987): 53-65.

No comments:

Post a Comment