Extracting Data from Websites Using R and JSONLite: A Step-by-Step Guide

Understanding Web Scraping and JSONLite

Web scraping is the process of extracting data from websites using automated tools. In this article, we will explore how to use web scraping with R and the JSONLite library to extract data from a specific website.

JSONLite is an R package that allows us to work with JSON (JavaScript Object Notation) data in R. It provides functions for converting between R vectors and JSON objects, as well as functions for manipulating and querying JSON data.

Setting Up Our Project

To begin, we need to set up our project by installing the necessary packages. We will install the rvest package, which is a popular package for web scraping in R. We will also install the jsonlite package, which allows us to work with JSON data in R.

install.packages("rvest")
install.packages("jsonlite")

Understanding the Website Structure

Our target website has a specific structure that we need to understand in order to extract the desired data. The website is divided into different sections, each containing a list of people with their corresponding emails and universities.

Extracting Data from the Website

We will use the rvest package to read the HTML content of the website and then parse it using the html_nodes() function. This function allows us to extract specific elements from the HTML document.

library(rvest)

# Read the HTML content of the website
url <- "http://www.example.com"
html_doc <- read_html(url)

# Extract the list of people
people_list <- html_nodes(html_doc, xpath = "//h3")

# Extract the emails and universities
emails <- sapply(people_list, function(x) {
  html_text(html_nodes(x, xpath = "//td[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a"))
})

universities <- sapply(people_list, function(x) {
  html_text(html_nodes(x, xpath = "//td[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//span"))
})

Creating a Data Frame

We will create a data frame to store the extracted data. The data frame will have three columns: people_name, emails, and university.

df <- data.frame(people_name = unlist(emails), emails = unlist(emails), university = rep("Stockholm University", length(emails)))

Handling Special Cases

There are some special cases that we need to handle. For example, the website has a list of people with their corresponding emails and universities in different sections. We need to extract data from these sections using XPath expressions.

for(i in 17:19) {
  # Extract people from section 17-19
  people_list <- html_nodes(html_doc, xpath = "//td[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a")
  
  # Extract emails
  emails <- sapply(people_list, function(x) {
    html_text(html_nodes(x, xpath = "//span"))
  })
  
  # Add people to the data frame
  df <- rbind(df, data.frame(people_name = unlist(emails), emails = unlist(emails), university = "Stockholm University"))
}

for(i in 20:22) {
  # Extract people from section 20-22
  people_list <- html_nodes(html_doc, xpath = "//h3")
  
  # Extract emails
  emails <- sapply(people_list, function(x) {
    html_text(html_nodes(x, xpath = "//td[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a"))
  })
  
  # Add people to the data frame
  df <- rbind(df, data.frame(people_name = unlist(emails), emails = unlist(emails), university = "Stockholm University"))
}

Cleaning and Saving the Data

Finally, we need to clean and save the extracted data. We can use the str_split() function to remove email addresses with mailto:.

df$emails <- gsub("mailto:", "", df$emails)
df$emails <- strsplit(df$emails, ",")[[1]]
df <- df[!, .(people_name, emails, university)]

# Save the data frame to a file
saveRDS(df, "data.RData")

Conclusion

In this article, we have demonstrated how to use web scraping and JSONLite to extract data from a specific website. We have also handled special cases such as extracting data from different sections of the website using XPath expressions. Finally, we have cleaned and saved the extracted data for further analysis.

Last modified on 2024-02-16