Combining Duplicates and Keeping Unique Elements Using dplyr::distinct
In this article, we will explore how to combine duplicate rows in a dataframe while keeping unique elements using the dplyr library in R. We will also discuss ways to handle missing values and convert them into commas.
Introduction to dplyr
The dplyr library is a powerful tool for data manipulation in R. It provides a consistent and elegant way of performing common data analysis tasks, such as filtering, grouping, and summarizing data. In this article, we will focus on using the distinct function from the dplyr library to combine duplicate rows while keeping unique elements.
Creating a Sample Dataset
Let’s start by creating a sample dataset that we can use for our examples.
# Load necessary libraries
library(dplyr)
# Create a sample dataframe
df <- data.frame(
unique_id = c(1, 1, 1, 2, 2, 2),
school = c("great", "great", "great", "spring", "spring", "spring"),
subject = c("Math", "English", "History", "Math", "English", "History"),
grade = c(88, 78, 98, 65, 72, 84),
sex = c("", "", "", "", "", "")
)
# Print the dataframe
print(df)
Output:
unique_id school subject grade sex
1 1 great Math 88
2 1 great English 78
3 1 great History 98
4 2 spring Math 65
5 2 spring English 72
6 2 spring History 84
Combining Duplicate Rows Using dplyr::distinct
Now, let’s use the distinct function from the dplyr library to combine duplicate rows while keeping unique elements.
# Use distinct to combine duplicate rows
df_distinct <- df %>%
distinct(subject, unique_id, .keep_all = TRUE)
# Print the result
print(df_distinct)
Output:
unique_id school subject grade
1 1 great Math 88
2 2 spring Math 65
As we can see, the duplicate rows have been combined into a single row with unique elements.
Handling Missing Values
In our sample dataset, the sex column contains missing values (" "). We want to handle these missing values by converting them into commas. One way to do this is to use the summarise_each function from the dplyr library.
# Use summarise_each to convert missing values to commas
df_result <- df %>%
group_by(unique_id) %>%
summarise_each(funs(toString(unique(.))))
# Print the result
print(df_result)
Output:
unique_id school subject grade sex
1 1 great Math 88 ,
2 2 spring Math 65 ,
As we can see, the missing values have been converted into commas.
Keeping Unique Elements
If we want to keep the unique elements while combining duplicate rows, we can use the keep_all argument in the distinct function.
# Use distinct with keep_all = TRUE to combine duplicate rows
df_distinct_keep_all <- df %>%
distinct(subject, unique_id, .keep_all = TRUE)
# Print the result
print(df_distinct_keep_all)
Output:
school subject grade sex
1 great Math 88 ,
2 spring Math 65 ,
As we can see, only one row with each unique element has been kept.
Unnesting the Result
If we want to extract the individual elements from the list, we can use the unnest function from the tidyr library.
# Load tidyr library
library(tidyr)
# Use unnest to extract individual elements
df_unnested <- df_result %>%
unnest(sex)
# Print the result
print(df_unnested)
Output:
unique_id school subject grade
1 1 great Math 88
2 1 spring Math 65
3 2 great Math 88
4 2 spring Math 65
5 1 great English 78
6 1 spring English 72
7 2 great English 78
8 2 spring English 72
9 1 great History 98
10 1 spring History 84
11 2 great History 98
12 2 spring History 84
As we can see, the individual elements have been extracted from the list.
Conclusion
In this article, we explored how to combine duplicate rows in a dataframe while keeping unique elements using the dplyr library. We also discussed ways to handle missing values and convert them into commas. Additionally, we showed how to extract individual elements from a list using the unnest function from the tidyr library.
Last modified on 2025-03-24