K-Means Clustering

Siddhartha Nimmaturi
4 min readApr 11, 2020

Clustering Algorithms are Unsupervised Learning Algorithms. As the name suggests that it forms clusters in the data. It uses the old concept of Geometry for classifying the data i.e., the concept of equidistant in geometry where the distance between that point and each object in the set are equal.

In Clustering, we will classify the data into particular segments. It transforms ‘n’ number of data points into ‘k’ number of clusters(Concept of Signal Processing). In Clustering, we cannot come to a conclusion of what we are looking for in the data. The variation within a cluster is minimized while the variation between the clusters is maximized.

For Example, Clustering is used if a company wants to increase brand awareness of a product to all the existing and future customers. So, they can implement a campaign to target these customers. But, customers can have different tastes. So, they should come up with multiple campaigns to target. They can target customers based on their age, income, gender, etc. By Targeting Customers, they can come up with better results and profits.

Clustering Algorithms are classified into:-

i) K-Means Clustering

ii) Hierarchical Clustering

K-Means Clustering

K-Means Clustering is one of the frequently used Clustering Algorithm. We use Euclidean Distance for finding the distance between the data points.

K-Means Clustering follows some steps to perform

i) Choosing the No. of Clusters

ii) We select a random K points (Centroids) not necessarily from our data.

iii) We will assign the data points to the closest centroid which forms K Clusters.

iv) Next, compute and place the new centroid of each cluster

v) Reassign each of the data points to the new closest centroid. If any changes required go to step 4, or else your model is ready to perform.

K-Means Clustering Model Implementation

Customer Spends Data

Dataset:- The Dataset is about an online grocery store that captures the customer's spending per annum ( INR) by 20 Customers on Apparel, Beauty & Healthcare Products. Dataset consists of Customer ID, Amount Spent in Apparels, Amount Spent on Beauty & Healthcare.

Scatter Plot

For Visualizing the amount spent on Apparels, Beauty & Healthcare we use a scatter plot. Clearly, we can see that there are 3 Clusters from the scatter plot of how the customer spends on apparel, Beauty & Healthcare.

Normalizing Features

Now, we will normalize the data to bring all the values to a normalized scale. sklearn.preprocessing normalizes all the values.

Creating Clusters using the Normalized features

K-Means Clusters

From the above code, we created a new column named clusterid_new. For the easy representation of clusters, we mark them with some symbols.

How Many Clusters to Select?

There are mainly two methods to find the number of clusters mainly the Dendrogram and Elbow Method. I have chosen the elbow method for finding the number of clusters.

Elbow Method:-

Elbow Method is used to select the optimal number of clusters for K-means clustering. The Visualization looks like a line chart where it looks like an arm, then the point at which it looks like an elbow refers to the number of clusters.

Elbow Method

From this Elbow diagram, we can say that the point of Elbow represents to 3 clusters. So, from the above scatter plot also we have already seen that 3 clusters are present. So, this elbow method also represents the same no. of clusters.

Visualizing K-Means

Visualizing the Data
K-Means with 3 Clusters

Yellow colour represents the Centroid. Centroids are randomly selected which is not necessarily from our data. The yellow coloured dots represent centroids which were randomly selected. Each data point is assigned to the nearest centroid of the cluster.

Interpreting the Results:-

So, from the above diagram, we have 3 clusters namely Fashionista, Beauty Obsession & Healthcare and both. The company can target these customers for future campaigns and tries to know the spending of customers or which area they might have to focus on better outcomes.

Data with Clusters

Finally, the data has been formed into specific clusters named 0 for Fashionista, 1 for Beauty Obsession and 2 for both the products the customer spends. In this way, K-Means Clustering is done. Yet, there’s a lot to learn in Clustering like we have another form i.e., Hierarchical Clustering which each data point acts as a cluster.

Though this is a small dataset, it becomes complicated if we have many clusters.

Please check the following link for full code

https://github.com/siddhu21/Customer-Spends-K-means/blob/master/Kmeans_customerspends.py

Till then Enjoy Machine Learning 😉😃!!!!!!!

--

--