What is Hierarchical Clustering? Hierarchical Clustering Explained
Hierarchical clustering is a popular unsupervised learning technique used to group similar data points into clusters based on their proximity or similarity. It builds a hierarchy of clusters by recursively merging or dividing clusters until a desired stopping criterion is met. The result is a tree-like structure called a dendrogram, which illustrates the relationships and similarities between the data points.
Here are the key steps and concepts involved in hierarchical clustering:
Distance or Similarity Measurement: A distance or similarity measure is used to quantify the dissimilarity or similarity between pairs of data points. Common distance measures include Euclidean distance, Manhattan distance, and cosine similarity. The choice of measure depends on the nature of the data and the problem at hand.
Agglomerative and Divisive Clustering: Hierarchical clustering can be performed in two ways: agglomerative and divisive. Agglomerative clustering starts with each data point as a separate cluster and iteratively merges the closest clusters until a stopping criterion is met. Divisive clustering starts with all data points in a single cluster and recursively divides the clusters until the desired number of clusters is obtained.
Linkage Methods: During the merging or dividing process, a linkage method determines how the distance or similarity between clusters is calculated. Common linkage methods include:
Single Linkage: Calculates the distance between the closest pair of points in different clusters.
Complete Linkage: Calculates the distance between the farthest pair of points in different clusters.
Average Linkage: Calculates the average distance between all pairs of points in different clusters.
Ward’s Method: Minimizes the increase in the total within-cluster variance when merging clusters.
Dendrogram: A dendrogram is a tree-like structure that visualizes the clustering process. It illustrates the order in which clusters are merged or divided and provides insights into the relationships between the data points. The height or distance at which clusters are merged or divided in the dendrogram indicates the dissimilarity between the clusters.
Determining the Number of Clusters: The number of clusters to be extracted from the hierarchical clustering process is not predetermined. It can be determined by interpreting the dendrogram or by using a threshold distance or a predefined number of clusters. Cutting the dendrogram at a certain height or distance results in a specific number of clusters.
Practical Considerations: Hierarchical clustering is computationally expensive, especially for large datasets, as the time complexity is quadratic or worse. Additionally, it is sensitive to the choice of distance metric, linkage method, and the order of data points. Preprocessing steps such as data scaling or normalization may be necessary to ensure fair comparisons.
Hierarchical clustering is widely used in various domains, including biology, social sciences, customer segmentation, image analysis, and more. It is particularly useful when the underlying structure of the data is hierarchical and when the interpretation of cluster relationships is important. The dendrogram provides a visual representation of the clustering process, allowing users to make informed decisions about the number and composition of clusters.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.