InfinityCodeX


Check out our blogs where we cover topics such as Python, Data Science, Machine Learning, Deep Learning. A Best place to start your AI career for beginner, intermediate peoples.

K-Means Clustering


Clustering is a type of Unsupervised Learning where you find patterns & data that you are working on; it may be the shape, size, color, etc… Which can be used to group data items or create clusters. This method is commonly used on the dataset of small size to model & analyze the data.





Clusters


Basically, there are 3 types of Clustering :

(i)  Exclusive Clustering

(ii) Overlapping Clustering

(iii) Hierarchical Clustering


(i) Exclusive Clustering :


This Exclusive Clustering is a Hard clustering in which the data points or the items exclusively belong to one cluster. An example of this kind of clustering is K-Means Clustering.

hard_Clusters


Both of these clusters are completely different and not at all related to each other.

(ii) Overlapping Clustering :


This Overlapping Clustering is a Soft clustering in which the data points or the items exclusively belong to multiple clusters. An example of this kind of clustering is Fuzzy, C-Means Clustering.

Soft_Clusters


As you can see that the some of the blue data points are overlapping the red data points.

(iii)Hierarchical Clustering :


When 2 clusters have a parent-child relationship or a tree-like structure then it is a Hierarchical Clustering.
In Hierarchical Clustering, there are 2 types of approach :

(a) Agglomerative :


Which supports the Bottom-Up approach. We begin with the whole set and proceed to divide it into successively smaller clusters.

Bottom-up


(b) Divisive :


Divisive Clustering is a Top-Down approach it, we begin with the whole set and proceed to divide it into successively smaller clusters,

Top-Down


Out of several types of clustering today our main focus will be on K-Means Clustering.

Our Agenda for this Blog post :

1.)  What is K-Means Clustering?

2.)  How does K-Means Clustering work?

3.)  Application of K-Means Clustering

4.)  Advantages and Disadvantages of K-Means Clustering

5.)  K-Means Clustering Example  


1.)What is K-Means Clustering?


K-Means Clustering is an Unsupervised Learning algorithm, unlike Supervised Learning in this case we don’t have any labeled data. The number of groups or clusters is represented by “K”. The algorithms then run iteratively assign each data points to one of the “K” troops based on the features which are already provided.


2.)How does K-Means Clustering Works?


Clusters


Let’s say we have a data set like this where X & Y axis represents 2 different features and we want to identify clusters in this data set. Now when the data set is given to us; we don’t have any information on the target variable so actually we don’t know what we are looking for all we are trying to do is identify some structure into this data set.

Clusters_division


One way of looking into this is these 2 clusters i.e C1 and C2 just by visual examination we can say that this data set has these 2 clusters and K-Means help us to identify these clusters. Now as I mentioned earlier that “K” in K-Means clustering is a free parameter whenever we start the algorithm we have to describe the value of “K” prior we jump into the algorithm. As you can see in the image here the value of our “K” is 2.

Now there are 6 essential steps in K-Means :

(a) Identify the points :


The very first step is to identify the “K” random points which we consider as the center of the clusters.

NOTE: Number of “K” should be equal to the number of clusters. Each cluster should have only one point.

Cluster_centroids


Start with “K” centroid by putting them at a random place on the 2D plane here K = 2. So the number of random points generated is 2.

(Start with K centroids by putting them at a random place, Here K = 2)

(b) Identify the Distance :


The second step in this is to identify the distance of each data points from the centroids.
Now if we see the diagram below if some point is closer to the Centroid 1 which is our Red Star then we can say that point belongs to Red cluster and if some point is closer to the Centroid 2 which is our Blue-star then we can say that point belongs to the Blue cluster. Now the simple way of identifying the distance is to draw 2 lines where one line is horizontal between Red and Blue centroid, whereas the second line will be perpendicular to that horizontal line and the important thing here is to draw that perpendicular vertical line exactly at the center of both the centroids.

Centroids_division


Now as we divided both the sides which helped us to produce 2 clusters that are the first Red cluster and the second Blue clusters. Now congratulation we got 2 imperfect clunky clusters. Now we will start to improve these clusters, we will make them better and better at every stage. The way we will do that is first we will try to adjust the centroid that is our red and blue stars form which we had created 2 clusters. What I mean is that let’s currently focused at the red cluster, look at all the points which are present in the red cluster, now let’s try to find the center of gravity of that cluster and now what we are going to do is that we are going to put our centroid 1 which is our Red star at the center of that data points which belongs to the red cluster and we will be doing same to the Blue cluster and centroid 2 which is our Blue star.

point_division


(Compute the distance of every point from centroid and cluster them accordingly)


(c) Adjust the Centroids :


centroids_position


So basically we will get this when we make the adjustment. Now we will repeat the same process again. What I mean is that again we will recompute the distance of each of these points from these centroids which are Red star and Blue star. If the points are nearer to the Red star (centroid 1) then it belongs to the Red cluster and if the points are closer to the Blue star (centroid 2) then it belongs to the Red cluster.

(Adjust centroids so that they become center of gravity for the given cluster)

(d) Re-Cluster every point :


Then what we will do is that we will repeat the above step once again i.e we will draw again an horizontal line and a vertical perpendicular line at the center of that horizontal line.

centroids_position_division


Now if you have noticed that the points which were once belong to the Blue cluster now belongs to the Red cluster because actually they are now more near to the Red star which is our red centroid.

(Again re-cluster every point based on their distance with centroid)

(e) Adjust Again :


Now we will keep on repeating this process you just recalculate our centroids then recalculate the distance of individual data points from these centroids and readjust the clusters until the point that none of the data points change the cluster.
(Again Adjust centroids)

(f) Recompute cluster :


Now if you see the above image there is only one point i.e point Pn which is changing its cluster. Which was in blue cluster now it is in Red cluster now and if you notice that after this we can't do any more adjustments to our data set.

centroids_clusters


Hence we can say that this is our final clusters.

(Recompute clusters and repeat this till data points stop changing clusters)

So here are the 6 steps for the K-Means.

Now the most important point here is that we need to supply “K” to our algorithm. Now I know that you had this question that what is the optimal value of “K”? because here we have 2-dimensional space in reality we will be having so many features and it is not easy or I’ll rather say it difficult to visualize that data on a scatter plot so which “K” should we start with?

So you wait for this answer is over guys. Actually there is a technique which we called as elbow method which will help us to find the optimal value of “K”.

Q.)What is Elbow Method and how to implement it?


Before diving directly into the Elbow method. Let’s say we took our data to set a scatter plot and showed this to different individuals and ask them how many clusters do they see in this scatter plot?
Some might say I see 4 clusters in this scatter plot.


four_centroids_clusters

Some individuals might say there are 6 clusters in this scatter plot.

six_centroids_clusters


Now if you noticed that different individuals have different opinions. So our job is to find out the optimal value of “K” for our scatter plot here our Elbow method comes to rescue us.

Elbow Method :


Elbow_organ


NOTE: In K-Means If we are Minimizing the distance between points in a cluster we are automatically Maximizing the distance between the cluster.

Now what we do here is we initially take any random value of “K” like saying we took K = 2 and we will try to compute Sum Of Square Error. What it means is for each of the clusters we try to compute the distance of individual data points from the centroid we square it then we sum it up.

SSE1_and_SSE2


So for the red cluster, we got the Sum Of Square Error 1 (SSE1). Now similarly for the second cluster, we will get the Sum Of Square Error 2 (SSE2) and we actually do this for all our clusters, in the end we will get the total sum of square error.

SSE = SSE1 + SSE2 + SSE3 + … + SSEk

Actually, this step is essential because it helps us to remove the negative values which help us to do the calculation in an easy way.
Now we had computed value of SSE for K = 2, so we repeat the same process for K = n (n: number you want).
Once we have that number we draw a plot.

Elbow


In this graph at the X-Axis, we had taken the value of “K” from 1 to 10 and on the Y-Axis. Now here we will joint all the points.
If you see that as we increase the number of clusters it will decrease the error now it’s kind of intuitive so think about it, at some point we can consider all our data points as one cluster individual where our SSE almost becomes zero.

Now the important thing here is to find out the elbow.

Elbow_point


Now the elbow in this graph is at number 3. So the points which suddenly change the shape of the graph’s curve is actually our perfect value of  “K”.

Elbow_graph


Now here is little code for that :


K_range = range(1,10)
SSE = []                                     # SSE K=1,2,3,... will be saved here
for K in K_range:
km = KMeans(n_clusters=K)
km.fit(df[['Age','Income']])       #What is our SSE? How we get it?
SSE.append(km.insertia_)
SSE


#Plot into chart
plt.xlabel('K')
plt.ylabel('SSE')
plt.plot(K_range,SSE)

Output :


Elbow_graph_example

Don't worry if you didn't get anything in this code I had explained it below in practical. 



3.) Application of K-Means Clustering



K-Means clustering is used in varieties of examples such as 


(i)   Marketing

(ii)  Biology

(iii) Entertainment

(iv) Academic Performance

(v)  Search Engine 



(i) Marketing :


Marketing


Amazon is using popular it’s product recommendation system. For example Let’s say you want to purchase sports shoes and when you type sports shoes in the search bar of Amazon you will see all kinds of sport’s shoes of different brands and different features. This is where clustering comes it shows you the cluster of products that you want.


(ii) Biology :


Biology

Biologist uses various features to separate various type of creates into mammals, vertebrates, plants, etc. In which the mammals are the creatures which do not lay eggs and give birth to the child directly, where are the vertebrates are the species which lays eggs.


(iii) Entertainment :


Entertainment


The best example of the entertainment industry is Netflix it recommends the movies based on your previous watch history.


(iv) Acadmic Performance :


Education


Based on the scores of the exam students are categorized in different categories such as A, B, C, D.


(v) Search Engine :


Google_search


Clustering forms the backbone for search engines. When a search is performed the search results need to be grouped together as the user requirement.


4.) Advantages and Disadvantages of K-Means Clustering :


pros_and_cons_of_kmeans



5.) K-Means Clustering Example :


So the problem we are going to solve is that we have a data set in which we have the age and income of different people. Now clustering this data points into various groups that we are going to find out is some characteristics of these groups. Maybe the group belongs to the particular group in America where the salaries are higher or the salaries are lower or that group belongs to a certain profession where the salaries are higher versus less.

So here is our Dataset :



# Import all the essential libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


# Import our data set

data = pd.read_csv("D:\MY_ML\income.csv")
data.head()

Output :

head_values



# extracting information from a dataset

data.info()

Output :

values



# Describing the dataset

data.describe().transpose()

Output :

values



# Plotting scatter plot

plt.scatter(data['Age'],data['Income'])

Output :

scatter_plot


As we plot the scatter plot we can easily see the 3 clusters which are present in the diagram.



Creating a model where we will be using SKlearn library.

# Create Model

from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)                   #Number of Clustering is 3
km

Output :


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)


Run K-Means Algorithm on Age and Income, the Computed cluster as per our criteria where we told our algorithm to identify 3 clusters

#cluster 1 = 0
#cluster 2 = 1
#cluster 3 = 2

from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)                 #Number of Clustering is 3
km

Output :


array([0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2])


# Appending a new column

data['Cluster'] = y_pred
data.head()

Output :

head_values



# Separating those 3 clusters into different dataframe

df1 = data[data.Cluster == 0]
df2 = data[data.Cluster == 1]
df3 = data[data.Cluster == 2]

plt.scatter(df1.Age,df1['Income'],color='red')
plt.scatter(df2.Age,df2['Income'],color='green')
plt.scatter(df3.Age,df3['Income'],color='blue')

plt.xlabel('Age')
plt.ylabel('Income')

Output :



scatter_plot

As we can clearly see that our scatter plot has a little problem with it, So the Green cluster looks perfectly fine but the Red and Blue doesn't look fine. This problem occurs because our scaling is not correct our Y-Axis is scaled from 40000 till 160000 and the range of X-Axis is very narrow its like hardly 20 and on Y-Axis is 120000. So when we don't scale our features properly we can face this kind of problem. That's why we have to do some preprocessing and use MinMaxScaler which will help us to normalize our features.


# Using MinMaxScaler

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(data[['Income']])
data['Income'] = scaler.transform(data[['Income']])

scaler.fit(data[['Age']])
data['Age'] = scaler.transform(data[['Age']])

data.head()

Output :

head_values

Again Run K-Means Algorithm on Age and Income which is now normalized, the Computed cluster as per our criteria where we told our algorithm to identify 3 clusters

#cluster 1 = 0
#cluster 2 = 1
#cluster 3 = 2

km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(data[['Age','Income']])
y_predicted

Output :


array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2])


Dropping the Cluster column because it was not normalized, and adding a new column cluster where the data points are normalized.

data['cluster'] = y_predicted
data.drop('Cluster',axis=1,inplace=True)
data.head()

Output :

head_values


Now as we have normalized our data, created a new column of it now let's plot the graph and see what are the changes we can see

df1 = data[data.cluster == 0]
df2 = data[data.cluster == 1]
df3 = data[data.cluster == 2]

plt.scatter(df1.Age,df1['Income'],color='red')
plt.scatter(df2.Age,df2['Income'],color='green')
plt.scatter(df3.Age,df3['Income'],color='blue')

plt.xlabel('Age')
plt.ylabel('Income')

Output :

scatter_plot

Now as we can clearly see that our Red, Green, Blue clusters are perfectly arranged. Now we are ready to find the centroids. To do that we have an amazing function named "cluster_centers_" which is very easy to implement.


# Centroids
# (Contains (X,Y) co-ordinates)

km.cluster_centers_

Output :


array([[0.72268908, 0.8974359 ],
[0.1372549 , 0.11633428],
[0.85294118, 0.2022792 ]])


As we successfully find out all the centroids with there respective co-ordinates let's plot them. In this "[:,0]" is our First Column, and "[:,1]" is our Second Column & "s" is our Marker size.

df1 = data[data.cluster == 0]
df2 = data[data.cluster == 1]
df3 = data[data.cluster == 2]

plt.figure(figsize=(10,8))

plt.scatter(df1.Age,df1['Income'],color='red')
plt.scatter(df2.Age,df2['Income'],color='green')
plt.scatter(df3.Age,df3['Income'],color='blue')

plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color="black",marker="*",label="centroid",s=300)

plt.xlabel('Age')
plt.ylabel('Income')
plt.legend()

Output :



centroids_scatter_plot

As you can see that "*" this are the center of our clusters. Now it's time to find out our elbow plot method. Basically, this example which we are currently solving is easy but when you will face some wide data set which has 40-50 features it will become messy and hard to understand. So at this point our Elbow plot used to help us. As we saw in theory we go through the number of "K" like we go from 1 to 10 then we try to calculate SSE which is Sum of Square Error and plot them and try to find the Elbow.

# Elbow plot

k_range = range(1,10)                   # define K-range

SSE = []
for k in k_range:
km = KMeans(n_clusters=k)        # Each iteration we will create new model with cluster = K
km.fit(data[['Age','Income']])       # We are trying to fit both of our important features
SSE.append(km.inertia_)              # In KMeans there is a parameter called inertia that will give us the SSE & we append that error to our array
SSE

Output :


[5.43401151198818,
2.091136388699078,
0.4750783498553096,
0.3625079900797329,
0.2664030124668416,
0.22020960864009398,
0.17299621932455464,
0.13265419827245162,
0.10383752586603559]


# Plotting our Elbow Graph

plt.xlabel('K')
plt.ylabel('Sum of Square Error')
plt.plot(k_range,SSE)

Output :

Elbow_graph


So as you can see in the about graph that the value of our "K" is 3 and we made quite a good prediction I'll say but in the complex dataset, it's kinda complicated.


Hey, guy's did you liked our content? Please leave a comment below and give us any suggestions you want and don't forget to share this content with your friends.

No comments:

No Spamming and No Offensive Language

Powered by Blogger.