Tuesday, April 7, 2015

scikit-learn to generate isotropic Gaussian blobs

Scikit-learn is an open source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms ,support vector machines, logistic regression, naive Bayes, random forests, gradient boosting, k-means DBSCAN, Decision Trees, Gaussian Process for ML, Manifold learning, Gaussian Mixture Models, Model Selection, Nearest Neighbors, Semi Supervised Classification, Feature Selection etc.

I was working on them for my Big Data research in last few week and I thought share it. I am planning to go some of them in my blog. Before that we need some sample data/ data frames. So Here I will be go through Generate isotropic Gaussian blobs[1] for clustering. I will be using 'sklearn.datasets.make_blobs'.

Returns:   
X : array of shape (The generated samples)

y : array of shape (The integer labels for cluster membership of each sample)

1 from sklearn.datasets.samples_generator import make_blobs
2
3 X, y =make_blobs()
4
5 print X.shape
6 print y


Out will be


image


Parameters:   



  • n_samples : The total number of points  (int, default=100)
    Points equally divided among clusters

  • n_features : The number of features (int, default=2)

  • centers : Centers of the cluster (int or array, default=3)

  • cluster_std: The standard deviation of the clusters (float or sequence, default=1.0)

  • center_box: The bounding box for each cluster center  (pair of floats (min, max), default=-10.0, 10.0)

  • shuffle : Shuffle the samples (boolean, default=True)

  • random_state : If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. (int, RandomState instance or None (default=None))



1 from sklearn.datasets.samples_generator import make_blobs
2
3 X, y =make_blobs(n_samples=6, n_features=2, centers=5, cluster_std=1.0, center_box=(1, 10.0), shuffle=True, random_state=0)


Play Time!!!


Here you will get array shape (6L, 2L)


To get more idea on generated data we can used Matplotlib[3].  You can added below line for python code and see point distribution over X and Y



1 import matplotlib.pyplot as plt
2
3 # Plot the training points
4 plt.scatter(X[:, 0], X[:, 1])
5 plt.xlabel('X axis')
6 plt.ylabel('Y axis')
7
8 plt.show()


Here is look of our data


image


Let increase our data sample to 1000 and distribute them over 4 centers with 0.5 standard deviation


1 from sklearn.datasets.samples_generator import make_blobs
2 centers = [[2, 2], [8, 9], [9, 5], [3,9]]
3 X, y =make_blobs(n_samples=1000, n_features=2, centers=centers, cluster_std=0.5, center_box=(1, 10.0), shuffle=True, random_state=0)
4
5 print X.shape
6 print y
7
8 import matplotlib.pyplot as plt
9
10 # Plot the training points
11 plt.scatter(X[:, 0], X[:, 1])
12 plt.xlabel('X axis')
13 plt.ylabel('Y axis')
14
15 plt.show()

image


I need to distribute my data points more around cluster center so I  increase standard deviation 0.9


image 


For to understand easily we can add separate colors for clusters


1 colors = ['r','g','b','c','k','y','m']
2 c = []
3 for i in y:
4 c.append(colors[i])

Finally your code will look like this


1 #adding sample data
2 from sklearn.datasets.samples_generator import make_blobs
3 centers = [[2, 2], [8, 9], [9, 5], [3,9],[4,4],[0,0],[2,5]]
4 X, y =make_blobs(n_samples=5000, n_features=2, centers=centers, cluster_std=0.9, center_box=(1, 10.0), shuffle=True, random_state=0)
5
6 #print our enarated sample data
7 print X[:, 0]
8 print y
9
10 #Drawing a chart for our generated dataset
11 import matplotlib.pyplot as plt
12
13 #set colors for the clusters
14 colors = ['r','g','b','c','k','y','m']
15 c = []
16 for i in y:
17 c.append(colors[i])
18
19 # Plot the training points
20 plt.scatter(X[:, 0], X[:, 1], c= c)
21 plt.gray()
22 plt.xlabel('X axis')
23 plt.ylabel('Y axis')
24
25 plt.show()

image


Try few


imageimage


 




[1] http://madhukaudantha.blogspot.com/2014/10/gaussian-function.html


[2] http://madhukaudantha.blogspot.com/2015/03/basic-functionality-of-series-or.html


[3] http://matplotlib.org/examples/index.html#examples-index

1 comment:

  1. Nice Madhuka! Stumbled on it searching for "isotropic Gaussian blobs". Nuwan Waidyanatha

    ReplyDelete