How to Perform Hierarchical Clustering in RStudio

3 min readAug 2, 2023

Key Points

Hierarchical clustering is a type of unsupervised learning that groups observations based on their similarity or dissimilarity without specifying the number of clusters beforehand.
To perform hierarchical clustering in RStudio, you must install and load two packages: factoextra and cluster. Then, you need to scale your data using the scale() function and perform hierarchical clustering using the agnes() function from the cluster package.
To visualize and interpret your clustering results, you can use a dendrogram, a tree-like diagram showing how the clusters are nested within each other. You can plot a dendrogram using the fviz_dend() function from the factoextra package.

Hierarchical clustering is a type of unsupervised learning, meaning you don’t need to have predefined labels or categories for your data. Instead, you let the algorithm discover the structure and patterns in your data by grouping observations based on their similarity or dissimilarity.

One of the advantages of hierarchical clustering is that you don’t need to specify the number of clusters beforehand, unlike other methods, such as k-means clustering. Instead, you can use a graphical representation called a dendrogram to visualize the hierarchy of clusters and decide how many clusters you want to use based on your analysis goals.

In this tutorial, you will learn:

What is hierarchical clustering, and how does it work
How to perform hierarchical clustering in RStudio using the agnes() function from the cluster package
How to choose the best method for measuring the distance between clusters
How to plot and interpret a dendrogram
How to cut the dendrogram at different levels to obtain different numbers of clusters
How to evaluate the quality of clusters using various metrics

What is Hierarchical Clustering, and How Does It Work?

The basic idea of hierarchical clustering is to start with each observation as its cluster and then merge the most similar clusters until all observations are in one big cluster. The result is a tree-like structure that shows how the clusters are nested within each other.

There are two main steps in hierarchical clustering:

Calculate the pairwise dissimilarity between each observation in the dataset. Choosing a distance metric that suits your data type and analysis objectives would be best. For example, you can use Euclidean distance for continuous numerical data or Jaccard distance for binary or categorical data.
Fuse observations into clusters. You need to choose a method for determining how close two clusters are and which ones to merge at each step. Several methods are available, such as:Complete linkage: Use the maximum distance between two observations from different clusters as the cluster distance.Single linkage: Use the minimum distance between two observations from different clusters as the cluster distance.Average linkage: Use the average distance between all pairs of observations from different clusters as the cluster distance.Centroid linkage: Use the distance between the centroids (mean vectors) of two clusters as the cluster distance.Ward’s method: Use the increase in the total within-cluster variance after merging two clusters as the cluster distance.

Some methods may produce better results depending on your data and analysis goals. For example, complete linkage produces compact and balanced clusters, while single linkage produces long and chain-like clusters.