Clustering Algorithms and Performance Measures
Clustering is an unsupervised machine learning method that is generally used for statistical data analysis across various fields and domains. The task is to divide the set of observations into subsets, that are called clusters or groups. Observations that belong to the same group have similar set of features and these characteristics differentiate the clusters from each other.
The K-Means Clustering Algorithm
This algorithm assumes that the total number of groups of clusters within the given data are already known. This method is also referred to as flat clustering and it is an iterative mechanism. The functioning of the algorithm works something like this:
- The number of assumed clusters within the data need to be defined (K)
- Classify the data points from the input into these given clusters
- Continuously adjust the size of clusters by moving the data points to their closest matching cluster
- With every iteration, the cluster centroids are updated until the centroids reach their optimal locations
# Let us now move towards creating a Clustering Algorithm in Python
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
# Import the KMeans package from Sci-Kit Learn
from sklearn.cluster import KMeans
# Creating randomly generated data in the form of Blobs
# Total Samples = 1000, Total Centers = 6, Standard Deviation of Clusters = 0.40
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=1000, centers=6, cluster_std=0.40)
# Plot the clusters on a Seaborn Scatterplot
plt.scatter(X[:, 0], X[:, 1], s=25);
plt.show()
>>>
# We see six distinct clusters on the plot
# Let us now import KMeans and find the centroids of these clusters
kmeans = KMeans(n_clusters=6)
# Fit the model with the Training data
kmeans.fit(X)
# Run the model on the given data plot the result graph
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=75, cmap='viridis')
centers = kmeans.cluster_centers_
>>>
# We can now see the clusters uniquely identifiable
# Checking implementation through centroids
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
plt.show()
>>>
# Centroids of the six clusters
Measuring the Performance of Clustering
Since clustering algorithms are not about predicting a value or deriving an answer, metrics like accuracy score matter a lot. There are various methods through which clustering algorithms can be measured. Since real-world data will likely never be segregated into sections, drawing inference is difficult. We will look at a method called
Silhouette Analysis to measure the performance of our clustering algorithm.
Analysis of silhouette score: The range of this score is between [-1, 1]. Below is a detailed analysis of measurement.
- Score of +1: A positive score of 1, shows that the predicted cluster data point is far away from its nearest neighboring cluster.
- Score of 0: A neutral or 0 score shows that the given sample is either residing on the decision boundary amongst distinct neighboring clusters or is very close to this boundary.
- Score of -1: A negative score of -1, gives the indication that a sample has been moved to the wrong cluster.
Formula for Calculation:
silhouette score = (p-q)/max(p,q)
In the above formula,
p represents the mean distance between the given points and corresponding data points in the nearest neighboring clusters.
q gives the mean distance between all points within the same cluster.
# Import all required libraries along with Sci-kit learn metrics function
import matplotlib.pyplot as plt
import seaborn as sns;
import numpy as np
from sklearn.cluster import KMeans
# The metrics function will help us use the silhouette score construct
from sklearn import metrics
# Create the sample space using make_blobs method
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=1500, centers=8, cluster_std=0.60)
# We have created a sample space with 1500 entries and 8 centers
# Create an empty numpy array and another array with evenly spaced values
scores = []
values = np.arange(4, 20)
# Iterate through the array and initialize each of the values for clustering with given k value
# 'k-means++' : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.
for num_clusters in values:
kmeans = KMeans(init='k-means++', n_clusters=num_clusters, n_init=10)
kmeans.fit(X)
# Import the silhouette score metric and use the Euclidean distance to check for ideal centroids
score = metrics.silhouette_score(X, kmeans.labels_, metric='euclidean', sample_size=len(X))
# The metric will now update the number of clusters and give the score
print("\nNumber of clusters =", num_clusters)
print("Silhouette score =", score)
scores.append(score)
>>> Number of clusters = 19
>>> Silhouette score = 0.3294464776999117
# Using numpy’s argmax function, we can now get the number of optimal clusters needed for this sample data
num_clusters = np.argmax(scores) + values[0]
print('\nIdeal number of clusters =', num_clusters)
>>> Ideal number of clusters = 4
Grouping Data and Approaching Nearest Neighbors
All classification and clustering problems, at their root solve the data using a grouping mechanism. This mechanism is often associated with creating clusters of data and trying to predict where the new data point will end up. On similar lines, exists another algorithm which helps in identifying Nearest Neighbors of groups of data. The idea behind finding nearest neighbors is to find the closest point to the input that can be associated with trained data.
- Algorithms that work on data grouping can be used for making predictions where choices are available. For example, recommending a specific type of garment to a user based on his shopping and browsing history. In this scenario, the algorithm uses the user’s preferences and history to club together a cluster of choices he/she is like to make.
- Another advantage of using a clustering mechanism is that, once the groups are formed, this algorithm easily adapts to newly presented data. Since clusters are generalized to various shapes and sizes, working on new data becomes faster and less likely to give errors.
- On the other hand, since classification converges choices to one of two choices, the core of these algorithms is also a narrowed set of groups. Therefore, identifying new data is faster.
Coming Up Next
The focus of this chapter was on grouping data into classification and clustering problems. Both methodologies, although follow different approaches, their core idea is of segregating data into groups. We also observed the various use cases in which these algorithms can be used.
- Understanding Regression in Machine Learning
- Feature Engineering and Ensemble Modelling in Python
- Improving Performance of Learning Algorithms