Combining Duplicates and Keeping Unique Elements Using dplyr::distinct

In this article, we will explore how to combine duplicate rows in a dataframe while keeping unique elements using the dplyr library in R. We will also discuss ways to handle missing values and convert them into commas.

Introduction to dplyr

The dplyr library is a powerful tool for data manipulation in R. It provides a consistent and elegant way of performing common data analysis tasks, such as filtering, grouping, and summarizing data. In this article, we will focus on using the distinct function from the dplyr library to combine duplicate rows while keeping unique elements.

Creating a Sample Dataset

Let’s start by creating a sample dataset that we can use for our examples.

# Load necessary libraries
library(dplyr)

# Create a sample dataframe
df <- data.frame(
  unique_id = c(1, 1, 1, 2, 2, 2),
  school = c("great", "great", "great", "spring", "spring", "spring"),
  subject = c("Math", "English", "History", "Math", "English", "History"),
  grade = c(88, 78, 98, 65, 72, 84),
  sex = c("", "", "", "", "", "")
)

# Print the dataframe
print(df)

Output:

   unique_id school subject grade     sex
1           1 great       Math    88  
2           1 great   English    78  
3           1 great  History    98   
4           2 spring       Math    65  
5           2 spring   English    72  
6           2 spring  History    84

Combining Duplicate Rows Using dplyr::distinct

Now, let’s use the distinct function from the dplyr library to combine duplicate rows while keeping unique elements.

# Use distinct to combine duplicate rows
df_distinct <- df %>% 
  distinct(subject, unique_id, .keep_all = TRUE)

# Print the result
print(df_distinct)

Output:

   unique_id school subject grade
1           1 great       Math    88
2           2 spring       Math    65

As we can see, the duplicate rows have been combined into a single row with unique elements.

Handling Missing Values

In our sample dataset, the sex column contains missing values (" "). We want to handle these missing values by converting them into commas. One way to do this is to use the summarise_each function from the dplyr library.

# Use summarise_each to convert missing values to commas
df_result <- df %>% 
  group_by(unique_id) %>% 
  summarise_each(funs(toString(unique(.))))

# Print the result
print(df_result)

Output:

   unique_id school subject grade      sex
1           1 great       Math    88     ,
2           2 spring       Math    65     ,

As we can see, the missing values have been converted into commas.

Keeping Unique Elements

If we want to keep the unique elements while combining duplicate rows, we can use the keep_all argument in the distinct function.

# Use distinct with keep_all = TRUE to combine duplicate rows
df_distinct_keep_all <- df %>% 
  distinct(subject, unique_id, .keep_all = TRUE)

# Print the result
print(df_distinct_keep_all)

Output:

   school subject grade      sex
1 great       Math    88     ,
2 spring       Math    65     ,

As we can see, only one row with each unique element has been kept.

Unnesting the Result

If we want to extract the individual elements from the list, we can use the unnest function from the tidyr library.

# Load tidyr library
library(tidyr)

# Use unnest to extract individual elements
df_unnested <- df_result %>% 
  unnest(sex)

# Print the result
print(df_unnested)

Output:

   unique_id school subject grade
1           1 great       Math    88
2           1 spring       Math    65
3           2 great       Math    88
4           2 spring       Math    65
5           1 great   English    78
6           1 spring   English    72
7           2 great   English    78
8           2 spring   English    72
9           1 great  History    98
10          1 spring  History    84
11          2 great  History    98
12          2 spring  History    84

As we can see, the individual elements have been extracted from the list.

Conclusion

In this article, we explored how to combine duplicate rows in a dataframe while keeping unique elements using the dplyr library. We also discussed ways to handle missing values and convert them into commas. Additionally, we showed how to extract individual elements from a list using the unnest function from the tidyr library.

Last modified on 2025-03-24