Understanding the hclust Function and Clustering in R
Introduction to Hierarchical Clustering
Hierarchical clustering is a method of grouping data points into clusters based on their similarity. It is a popular technique used in various fields such as machine learning, statistics, and data analysis. In this article, we will delve into the world of hierarchical clustering using the hclust function in R.
The hclust Function
The hclust function in R performs hierarchical clustering on a given dataset. It takes two main parameters: the distance matrix and the linkage method. The distance matrix is a square matrix where each element represents the distance between two data points. The linkage method determines how the distances are combined to form the clusters.
Basic Example: Clustering the USArrests Dataset
To demonstrate the usage of the hclust function, let’s consider the well-known USArrests dataset in R. This dataset contains various crime statistics for each state in the United States.
# Load the datasets library
library(datasets)
# Load the USArrests dataset
data(USArrests)
# Calculate the distance matrix using the dist() function
distance_matrix <- dist(USArrests)
# Perform hierarchical clustering on the USArrests dataset
hc <- hclust(distance_matrix, "ave")
Understanding the Results
The hclust function returns an object of class hclust, which contains information about the clustering. The results are displayed in a hierarchical tree diagram using the plot() function.
However, we are interested in obtaining a list of clusters instead of a tree plot. This is where the cutree() function comes into play.
Cutting the Tree: Obtaining Cluster Labels
The cutree() function is used to cut the tree into a specified number of branches or clusters. We can use this function to obtain cluster labels for each data point.
# Cut the tree into 2 clusters ( Alabama and Alaska)
cluster_labels <- cutree(hc, k = 2)
# Print the cluster labels
print(cluster_labels)
Customizing the Cutting Process
We can customize the cutting process by specifying additional parameters. For instance, we can use the h parameter to set a threshold value for the distances.
# Cut the tree into 2 clusters with a threshold of 110
custom_cluster_labels <- cutree(hc, k = 2, h = 110)
# Print the custom cluster labels
print(custom_cluster_labels)
Output Format: A Cluster List
The output format for the cutree() function is a vector of length equal to the number of data points. Each element in the vector corresponds to the cluster label assigned to the corresponding data point.
# Example output:
#
# Alabama Alaska Arizona Arkansas California
# 1 1 1 2 1
# Colorado Connecticut Delaware Florida Georgia
# 2 2 1 1 2
# Hawaii Idaho Illinois Indiana Iowa
# 2 2 1 2 2
# Kansas Kentucky Louisiana Maine Maryland
# 2 2 1 2 1
#
# Massachusetts Michigan Minnesota Mississippi Missouri
# 2 1 2 1 2
# Montana Nebraska Nevada New Hampshire New Jersey
# 2 2 1 2 2
# New Mexico New York North Carolina North Dakota Ohio
# 1 1 1 2 2
# Oklahoma Oregon Pennsylvania Rhode Island South Carolina
# 2 2 2 2 1
#
# South Dakota Tennessee Texas Utah Vermont
# 2 2 2 2 2
# Virginia Washington West Virginia Wisconsin Wyoming
# 2 2 2 2 2
Alternative Methods: Vector or Matrix Output
While the default output format is a vector of cluster labels, we can also obtain a matrix or vector output using alternative methods.
# Example: Vector output with cluster labels as a character string
cluster_labels <- cutree(hc, k = 2, labels = TRUE)
# Example: Matrix output with cluster labels as a numerical value
cluster_matrix <- cutree(hc, k = 2, matrix = TRUE)
Conclusion
In this article, we have explored the world of hierarchical clustering using the hclust function in R. We have discussed various aspects of the function, including its usage, customization options, and output formats. By applying these concepts to our own datasets, we can gain valuable insights into the structure and organization of our data.
Additional Resources
For further learning on hierarchical clustering and related techniques, please refer to the following resources:
- “Hierarchical Clustering” by IBM Data Science Experience
- “Clustering Algorithms in R” by DataCamp
- “R Programming: A Comprehensive Guide” by DataCamp
Last modified on 2024-09-30