Understanding Hierarchical Clustering with R's hclust Function and Clustering Methods

Understanding the hclust Function and Clustering in R

Introduction to Hierarchical Clustering

Hierarchical clustering is a method of grouping data points into clusters based on their similarity. It is a popular technique used in various fields such as machine learning, statistics, and data analysis. In this article, we will delve into the world of hierarchical clustering using the hclust function in R.

The hclust Function

The hclust function in R performs hierarchical clustering on a given dataset. It takes two main parameters: the distance matrix and the linkage method. The distance matrix is a square matrix where each element represents the distance between two data points. The linkage method determines how the distances are combined to form the clusters.

Basic Example: Clustering the USArrests Dataset

To demonstrate the usage of the hclust function, let’s consider the well-known USArrests dataset in R. This dataset contains various crime statistics for each state in the United States.

# Load the datasets library
library(datasets)

# Load the USArrests dataset
data(USArrests)

# Calculate the distance matrix using the dist() function
distance_matrix <- dist(USArrests)

# Perform hierarchical clustering on the USArrests dataset
hc <- hclust(distance_matrix, "ave")

Understanding the Results

The hclust function returns an object of class hclust, which contains information about the clustering. The results are displayed in a hierarchical tree diagram using the plot() function.

However, we are interested in obtaining a list of clusters instead of a tree plot. This is where the cutree() function comes into play.

Cutting the Tree: Obtaining Cluster Labels

The cutree() function is used to cut the tree into a specified number of branches or clusters. We can use this function to obtain cluster labels for each data point.

# Cut the tree into 2 clusters ( Alabama and Alaska)
cluster_labels <- cutree(hc, k = 2)

# Print the cluster labels
print(cluster_labels)

Customizing the Cutting Process

We can customize the cutting process by specifying additional parameters. For instance, we can use the h parameter to set a threshold value for the distances.

# Cut the tree into 2 clusters with a threshold of 110
custom_cluster_labels <- cutree(hc, k = 2, h = 110)

# Print the custom cluster labels
print(custom_cluster_labels)

Output Format: A Cluster List

The output format for the cutree() function is a vector of length equal to the number of data points. Each element in the vector corresponds to the cluster label assigned to the corresponding data point.

# Example output:
#
#    Alabama         Alaska        Arizona       Arkansas     California 
#             1              1              1              2              1 
#      Colorado    Connecticut       Delaware        Florida        Georgia 
#             2              2              1              1              2 
#        Hawaii          Idaho       Illinois        Indiana           Iowa 
#             2              2              1              2              2 
#        Kansas       Kentucky      Louisiana          Maine       Maryland 
#             2              2              1              2              1 
#
# Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
#             2              1              2              1              2 
#       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
#             2              2              1              2              2 
#    New Mexico       New York North Carolina   North Dakota           Ohio 
#             1              1              1              2              2 
#      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
#             2              2              2              2              1 
#
#  South Dakota      Tennessee          Texas           Utah        Vermont 
#             2              2              2              2              2 
#      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
#             2              2              2              2              2

Alternative Methods: Vector or Matrix Output

While the default output format is a vector of cluster labels, we can also obtain a matrix or vector output using alternative methods.

# Example: Vector output with cluster labels as a character string
cluster_labels <- cutree(hc, k = 2, labels = TRUE)

# Example: Matrix output with cluster labels as a numerical value
cluster_matrix <- cutree(hc, k = 2, matrix = TRUE)

Conclusion

In this article, we have explored the world of hierarchical clustering using the hclust function in R. We have discussed various aspects of the function, including its usage, customization options, and output formats. By applying these concepts to our own datasets, we can gain valuable insights into the structure and organization of our data.

Additional Resources

For further learning on hierarchical clustering and related techniques, please refer to the following resources:

  • “Hierarchical Clustering” by IBM Data Science Experience
  • “Clustering Algorithms in R” by DataCamp
  • “R Programming: A Comprehensive Guide” by DataCamp

Last modified on 2024-09-30