Selecting and Filtering on the Same Variables in dplyr

Selecting and Filtering on the Same Variables in dplyr

Introduction

The popular R package, dplyr, provides a powerful and flexible way to manipulate and analyze data. One of its key features is the ability to filter and select data based on specific conditions. In this article, we will explore how to use dplyr’s select and filter functions to select and filter observations that meet certain criteria.

Problem Statement

Suppose we have a matrix with 3 columns and 100 rows. The column names are a_dem, b_dem, and c_blah. Each cell can have a value between 0 and 100. We want to use the select function, followed by the filter function, and finally the %>% operator to select only the observations that end with “_dem” and have a value larger than 50.

Attempting Direct Selection

When we attempt to write the code as shown in the question:

dat %>% 
    select(ends_with("dem")) %>% 
        filter(gt; 50) %>% 
            summary()

We are met with an error. This is because select and filter cannot be used together to achieve our desired outcome.

Solution: Using Separate Functions

To accomplish this task, we need to use separate functions for selection and filtering. In this example, we will first select the columns that end with “_dem”, then filter out observations where the value is less than or equal to 50.

Here’s how you can do it:

library(dplyr)
set.seed(2)

a_dem <- runif(100, 0, 100)
b_dem <- runif(100, 0, 100)
c_blah <- runif(100, 0, 100)

dat <- data.frame(a_dem, b_dem, c_blah)

newdat1 <- dat %>%
    select(ends_with("_dem"))

filtered <- sapply(newdat1, function(x) ifelse(x > 50, x, NA))

head(filtered)

This code first selects the columns that end with “_dem”, then uses the sapply function to create a new dataset where observations where the value is less than or equal to 50 are replaced with NA. Finally, we print out the first six rows of the resulting dataset.

Alternative Solution: Using dplyr’s mutate_each Function

Another way to achieve this task using only dplyr functions is by using the mutate_each function. This function allows you to apply a function to each column of the data frame and returns a new data frame with the modified columns.

Here’s how you can do it:

library(dplyr)
set.seed(2)

a_dem <- runif(100, 0, 100)
b_dem <- runif(100, 0, 100)
c_blah <- runif(100, 0, 100)

dat <- data.frame(a_dem, b_dem, c_blah)

newdat2 <- dat %>%
    select(ends_with("_dem")) %>%
    mutate_each(funs(((function(x){ifelse(x>50, x, NA)})(.))))

head(newdat2)

This code first selects the columns that end with “_dem”, then uses the mutate_each function to apply a function to each column of these selected columns. The function checks if the value in the column is greater than 50; if it is, the value is kept; otherwise, it’s replaced with NA. Finally, we print out the first six rows of the resulting dataset.

Conclusion

In this article, we have explored how to use dplyr’s select and filter functions to select and filter observations that meet certain criteria. We have also seen two alternative ways to achieve this task using separate functions and the mutate_each function. By mastering these techniques, you can efficiently manipulate and analyze your data with ease.

Additional Advice

  • Always use meaningful variable names when working with your data.
  • Make sure to check the documentation for each dplyr function to understand its usage and capabilities.
  • Practice using different dplyr functions on small datasets before applying them to larger datasets.

Example Use Cases

Here’s an example of how you can use these techniques in a real-world scenario:

# Load necessary libraries
library(dplyr)
library(ggplot2)

# Create sample data frame
set.seed(123)
df <- data.frame(
    name = c("Alice", "Bob", "Charlie"),
    age = runif(3, 20, 60),
    score = runif(3, 50, 100)
)

# Select columns based on pattern and filter rows with a specific value
df %>%
    select(ends_with("_score")) %>%
    filter(score > 70) %>%
    ggplot(aes(x = name, y = score)) +
    geom_point() +
    labs(title = "Score Points", x = "Name", y = "Score")

This code first selects the columns that end with “_score”, then filters rows where the value is greater than 70. Finally, it uses ggplot2 to visualize the scores.

In this example, we have shown how you can use dplyr’s select and filter functions in combination with ggplot2 to create a useful visualization of your data.


Last modified on 2023-08-25