R Function to Search in Character String
Problem Statement
We are given a dataframe with two columns: NAICS_CD and top_3. The task is to create an R function that searches for the presence of numbers in the NAICS_CD column within the top 3 values specified in the top_3 column. If any number from top_3 is found in NAICS_CD, we want to assign a value of 1 to the is_present column; otherwise, we assign a value of 0.
Solution
To accomplish this task, we can use the str_detect function from the stringr package together with an ifelse statement in R. Here’s how you can do it:
library(dplyr)
library(stringr)
df %>%
mutate(is_present = ifelse(str_detect(top_3, as.character(NAICS_CD)), 1, 0))
Explanation
Let’s break down the solution step by step:
str_detectis a function from thestringrpackage that checks for the presence of a specified pattern in a character string. It returns a logical vector indicating whether the specified pattern was found.In our case, we want to check if any number (a numeric value) from
top_3is present inNAICS_CD. Sincestr_detectonly works on character strings, we first convertNAICS_CDto a character string usingas.character(). This is necessary becausestr_detectcan’t directly compare numbers with characters.We use the
ifelsestatement to assign 1 tois_presentif any number fromtop_3is found inNAICS_CD, and 0 otherwise. The condition forifelsechecks if the logical vector returned bystr_detectisTRUE.
Data
To demonstrate this solution, let’s create a sample dataframe:
df <- structure(list(NAICS_CD = c(541611L, 812990L, 424950L, 722330L,
722320L, 531180L, 484121L, 531311L), top_3 = c("[\"541611\",\"541618\",\"611430\"]",
"[\"561720\",\"561740\",\"561790\"]", "[\"444120\",\"711510\",\"811121\"]",
"[\"311991\",\"722310\",\"722320\"]", "[\"722320\",\"722330\",\"722310\"]",
"[\"531110\",\"531190\",\"531111\"]", "[\"484121\",\"484110\",\"484230\"]",
"[\"531110\",\"531311\",\"531111\"]")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
Example Use Case
Here’s an example of how to use this function:
# First, load the necessary libraries
library(dplyr)
library(stringr)
# Create the dataframe
df <- structure(list(NAICS_CD = c(541611L, 812990L, 424950L, 722330L,
722320L, 531180L, 484121L, 531311L), top_3 = c("[\"541611\",\"541618\",\"611430\"]",
"[\"561720\",\"561740\",\"561790\"]", "[\"444120\",\"711510\",\"811121\"]",
"[\"311991\",\"722310\",\"722320\"]", "[\"722320\",\"722330\",\"722310\"]",
"[\"531110\",\"531190\",\"531111\"]", "[\"484121\",\"484110\",\"484230\"]",
"[\"531110\",\"531311\",\"531111\"]")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
# Apply the function to the dataframe
df %>%
mutate(is_present = ifelse(str_detect(top_3, as.character(NAICS_CD)), 1, 0))
# Print the resulting dataframe
print(df)
When you run this code, it will create a new column is_present in the df dataframe based on whether any number from top_3 is found in NAICS_CD. If a match is found, the value of is_present will be 1; otherwise, it will be 0.
Last modified on 2025-03-29