R Function to Search in Character String
Problem Statement
We are given a dataframe with two columns: NAICS_CD and top_3. The task is to create an R function that searches for the presence of numbers in the NAICS_CD column within the top 3 values specified in the top_3 column. If any number from top_3 is found in NAICS_CD, we want to assign a value of 1 to the is_present column; otherwise, we assign a value of 0.
Solution
To accomplish this task, we can use the str_detect function from the stringr package together with an ifelse statement in R. Here’s how you can do it:
library(dplyr)
library(stringr)
df %>% 
  mutate(is_present = ifelse(str_detect(top_3, as.character(NAICS_CD)), 1, 0))
Explanation
Let’s break down the solution step by step:
- str_detectis a function from the- stringrpackage that checks for the presence of a specified pattern in a character string. It returns a logical vector indicating whether the specified pattern was found.
- In our case, we want to check if any number (a numeric value) from - top_3is present in- NAICS_CD. Since- str_detectonly works on character strings, we first convert- NAICS_CDto a character string using- as.character(). This is necessary because- str_detectcan’t directly compare numbers with characters.
- We use the - ifelsestatement to assign 1 to- is_presentif any number from- top_3is found in- NAICS_CD, and 0 otherwise. The condition for- ifelsechecks if the logical vector returned by- str_detectis- TRUE.
Data
To demonstrate this solution, let’s create a sample dataframe:
df <- structure(list(NAICS_CD = c(541611L, 812990L, 424950L, 722330L, 
                                722320L, 531180L, 484121L, 531311L), top_3 = c("[\"541611\",\"541618\",\"611430\"]", 
                                                                            "[\"561720\",\"561740\",\"561790\"]", "[\"444120\",\"711510\",\"811121\"]", 
                                                                            "[\"311991\",\"722310\",\"722320\"]", "[\"722320\",\"722330\",\"722310\"]", 
                                                                            "[\"531110\",\"531190\",\"531111\"]", "[\"484121\",\"484110\",\"484230\"]", 
                                                                            "[\"531110\",\"531311\",\"531111\"]")), class = "data.frame", row.names = c("1", 
                                                                                                    "2", "3", "4", "5", "6", "7", "8"))
Example Use Case
Here’s an example of how to use this function:
# First, load the necessary libraries
library(dplyr)
library(stringr)
# Create the dataframe
df <- structure(list(NAICS_CD = c(541611L, 812990L, 424950L, 722330L, 
                                722320L, 531180L, 484121L, 531311L), top_3 = c("[\"541611\",\"541618\",\"611430\"]", 
                                                                            "[\"561720\",\"561740\",\"561790\"]", "[\"444120\",\"711510\",\"811121\"]", 
                                                                            "[\"311991\",\"722310\",\"722320\"]", "[\"722320\",\"722330\",\"722310\"]", 
                                                                            "[\"531110\",\"531190\",\"531111\"]", "[\"484121\",\"484110\",\"484230\"]", 
                                                                            "[\"531110\",\"531311\",\"531111\"]")), class = "data.frame", row.names = c("1", 
                                                                                                    "2", "3", "4", "5", "6", "7", "8"))
# Apply the function to the dataframe
df %>% 
  mutate(is_present = ifelse(str_detect(top_3, as.character(NAICS_CD)), 1, 0))
# Print the resulting dataframe
print(df)
When you run this code, it will create a new column is_present in the df dataframe based on whether any number from top_3 is found in NAICS_CD. If a match is found, the value of is_present will be 1; otherwise, it will be 0.
Last modified on 2023-07-07