Hierarchical Clustering: It’s just the order of clusters!

Sanchita Paul
Analytics Vidhya
Published in
3 min readApr 22, 2021

--

What is Hierarchical Clustering? Well by definition it is an unsupervised method of creating similar groups from top to bottom or bottom to top.

There are two major types of this clustering: Agglomerative and Divisive.

I will be trying to explain the first type of clustering first and then the second type. Agglomerative clustering can be very well understood by heat maps.

Clustering can be done in various ways but I have found heat maps to be most intuitive.

Heatmaps can be found in the seaborn library, they give us a fair understanding of the correlation between variables.

In this process of clustering, the first step is to identify each element as a single point cluster.

In our example the clusters will be :each measurement is a cluster

In the second step, we take the two closest data points and make them into one cluster or most similar variables. This is done by taking the distance between the points.

There are various ways to take the distance between data points and I think that this is arbitrary and should be selected according to what the data requires.

Now, even without calculating the distance we can use the intuition of using the heatmaps. Don’t some measurements look too similar to each other in color?

I am taking the Euclidean Distance between the measurements to find out the clusters which are most similar (minimum distance). Let me show you one example of this :

Measurement 6 and 7 will be one cluster because the distance between these are minimum

Euclidean Distance can be calculated as ((difference in variable 1)² + (difference in variable 2)² + (difference in variable 3)³ + (difference in variable 4)² ) ^1/2

This as you must have guessed is minimum thus it forms one single cluster.

In similar ways we can make various clusters.

In the third step we combine these smaller clusters into even bigger clusters and this continues until we have one big cluster which looks like this:

The types of distances:

  1. Single Linkage- The shortest distance between two clusters “r” and “s”. It comes with the problem of not being able to separate clusters properly if there is noise and prematurely merge groups with close pairs even if those groups are dissimilar overall.

2. Complete Linkage- This is the longest distance between two clusters “r” and “s”. This will very biased towards globular clusters and break large clusters

3. Average Linkage- The distance between the two clusters is defined as the average distance between the points “r” and “s” . This is also biased towards globular clusters.

4. Ward’s Method: This method is similar to average linkage except it calculates the sum of squares of distances Pi and Pj

Mathematically written as:

sim(C1,C2)=∑(dist(Pi,Pj))²/(|C1|∗|C2|)

How should we Choose the Number of Clusters in Hierarchical Clustering?

A dendrogram is a tree-like diagram that records the sequences of merges or splits. More the distance of the vertical lines in the dendrogram, more the distance between those clusters.

We can set a threshold distance and draw a horizontal line (Generally, we try to set the threshold in such a way that it cuts the tallest vertical line).

--

--

Sanchita Paul
Analytics Vidhya

Hi I am Sanchita, an engineer, a math enthusiast, an AlmaBetter Datascience trainee and writer at Analytics Vidhya