14  Web scraping

The data we need to answer a question is not always in a spreadsheet ready for us to read. For example, the US murders dataset we used in the R Basics chapter originally comes from this Wikipedia page:

url <- "https://en.wikipedia.org/w/index.php?title=Gun_violence_in_the_United_States_by_state&direction=prev&oldid=810166167"

Web scraping, or web harvesting, is the term we use to describe the process of extracting data from a website. The reason we can do this is because the information used by a browser to render webpages is received as a text file from a server. The text is code written in hyper text markup language (HTML). Every browser has a way to show the html source code for a page, each one different. On Chrome, you can use Control-U on a PC and command+alt+U on a Mac.

14.1 HTML

Here is the key piece of code:

<table class="wikitable sortable">
<tr>
<th>State</th>
<th><a href="/wiki/List_of_U.S._states_and_territories_by_population" 
title="List of U.S. states and territories by population">Population</a><br />
<small>(total inhabitants)</small><br />
<small>(2015)</small> <sup id="cite_ref-1" class="reference">
<a href="#cite_note-1">[1]</a></sup></th>
<th>Murders and Nonnegligent
<p>Manslaughter<br />
<small>(total deaths)</small><br />
<small>(2015)</small> <sup id="cite_ref-2" class="reference">
<a href="#cite_note-2">[2]</a></sup></p>
</th>
<th>Murder and Nonnegligent
<p>Manslaughter Rate<br />
<small>(per 100,000 inhabitants)</small><br />
<small>(2015)</small></p>
</th>
</tr>
<tr>
<td><a href="/wiki/Alabama" title="Alabama">Alabama</a></td>
<td>4,853,875</td>
<td>348</td>
<td>7.2</td>
</tr>
<tr>
<td><a href="/wiki/Alaska" title="Alaska">Alaska</a></td>
<td>737,709</td>
<td>59</td>
<td>8.0</td>
</tr>
<tr>

14.2 The rvest package

The tidyverse provides a web harvesting package called rvest. The first step using this package is to import the webpage into R. The package makes this quite simple:

library(tidyverse)
library(rvest)
h <- read_html(url)

Note that the entire Murders in the US Wikipedia webpage is now contained in h. The class of this object is:

class(h)
[1] "xml_document" "xml_node"    

The rvest package is actually more general; it handles XML documents. XML is a general markup language (that’s what the ML stands for) that can be used to represent any kind of data. HTML is a specific type of XML specifically developed for representing webpages. Here we focus on HTML documents.

Now, how do we extract the table from the object h? If we print h, we don’t really see much:

h
{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ...

We can see all the code that defines the downloaded webpage using the html_text function like this:

html_text(h)

If we look at it, we can see the data we are after are stored in an HTML table: you can see this in this line of the HTML code above <table class="wikitable sortable">. The different parts of an HTML document, often defined with a message in between < and > are referred to as nodes. The rvest package includes functions to extract nodes of an HTML document: html_nodes extracts all nodes of different types and html_node extracts the first one. To extract the tables from the html code we use:

tab <- h |> html_nodes("table")

Now, instead of the entire webpage, we just have the html code for the tables in the page:

tab
{xml_nodeset (2)}
[1] <table class="wikitable sortable"><tbody>\n<tr>\n<th>State\n</th>\n<th>\n ...
[2] <table class="nowraplinks hlist mw-collapsible mw-collapsed navbox-inner" ...

The table we are interested is the first one:

tab[[1]]
{html_node}
<table class="wikitable sortable">
[1] <tbody>\n<tr>\n<th>State\n</th>\n<th>\n<a href="/wiki/List_of_U.S._states ...

This is clearly not a tidy dataset, not even a data frame. In the code above, you can definitely see a pattern and writing code to extract just the data is very doable. In fact, rvest includes a function just for converting HTML tables into data frames:

tab <- read_html(url) |> html_nodes("table") %>% .[[1]] |> html_table()
class(tab)
[1] "tbl_df"     "tbl"        "data.frame"

We are now much closer to having a usable data table:

tab <- tab |> setNames(c("state", "population", "total", "murder_rate")) 
head(tab)
# A tibble: 6 × 4
  state      population total murder_rate
  <chr>      <chr>      <chr>       <dbl>
1 Alabama    4,853,875  348           7.2
2 Alaska     737,709    59            8  
3 Arizona    6,817,565  309           4.5
4 Arkansas   2,977,853  181           6.1
5 California 38,993,940 1,861         4.8
6 Colorado   5,448,819  176           3.2

We still have some wrangling to do. For example, we need to remove the commas and turn characters into numbers. Before continuing with this, we will learn a more general approach to extracting information from web sites.

14.3 CSS selectors

The default look of a webpage made with the most basic HTML is quite unattractive. The aesthetically pleasing pages we see today are made using CSS to define the look and style of webpages. The fact that all pages for a company have the same style usually results from their use of the same CSS file to define the style. The general way these CSS files work is by defining how each of the elements of a webpage will look. The title, headings, itemized lists, tables, and links, for example, each receive their own style including font, color, size, and distance from the margin. CSS does this by leveraging patterns used to define these elements, referred to as selectors. An example of such a pattern, which we used above, is table, but there are many, many more.

If we want to grab data from a webpage and we happen to know a selector that is unique to the part of the page containing this data, we can use the html_nodes function. However, knowing which selector can be quite complicated. In fact, the complexity of webpages has been increasing as they become more sophisticated. For some of the more advanced ones, it seems almost impossible to find the nodes that define a particular piece of data. However, selector gadgets actually make this possible.

SelectorGadget1 is piece of software that allows you to interactively determine what CSS selector you need to extract specific components from the webpage. If you plan on scraping data other than tables from html pages, we highly recommend you install it. A Chrome extension is available which permits you to turn on the gadget and then, as you click through the page, it highlights parts and shows you the selector you need to extract these parts. There are various demos of how to do this including rvest author Hadley Wickham’s vignette2 and other tutorials based on the vignette3 4.

14.4 JSON

Sharing data on the internet has become more and more common. Unfortunately, providers use different formats, which makes it harder for data scientists to wrangle data into R. Yet there are some standards that are also becoming more common. Currently, a format that is widely being adopted is the JavaScript Object Notation or JSON. Because this format is very general, it is nothing like a spreadsheet. This JSON file looks more like the code you use to define a list. Here is an example of information stored in a JSON format:


Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':

    flatten
[
  {
    "name": "Miguel",
    "student_id": 1,
    "exam_1": 85,
    "exam_2": 86
  },
  {
    "name": "Sofia",
    "student_id": 2,
    "exam_1": 94,
    "exam_2": 93
  },
  {
    "name": "Aya",
    "student_id": 3,
    "exam_1": 87,
    "exam_2": 88
  },
  {
    "name": "Cheng",
    "student_id": 4,
    "exam_1": 90,
    "exam_2": 91
  }
] 

The file above actually represents a data frame. To read it, we can use the function fromJSON from the jsonlite package. Note that JSON files are often made available via the internet. Several organizations provide a JSON API or a web service that you can connect directly to and obtain data. Here is an example providing information Nobel prize winners:

nobel <- fromJSON("http://api.nobelprize.org/v1/prize.json")

This downloads a list. The first argument is a table with information about Nobel prize winners:

filter(nobel$prize, category == "literature" & year == "1971") |> pull(laureates)
[[1]]
   id firstname surname
1 645     Pablo  Neruda
                                                                                               motivation
1 "for a poetry that with the action of an elemental force brings alive a continent's destiny and dreams"
  share
1     1

You can learn much more by examining tutorials and help files from the jsonlite package. This package is intended for relatively simple tasks such as converting data into tables. For more flexibility, we recommend rjson.

14.5 APIs

An API, or Application Programming Interface, is a set of rules and protocols that allows different software entities to communicate with each other. It defines methods and data formats that software components should use when requesting and exchanging information.

APIs can be understood as middlemen between different software systems, facilitating their interactions. They play a crucial role in enabling integration that make today’s software so interconnected and versatile.

There are several types of APIs. The main ones related to retrieving data are:

  • Web Services:
    • Often built using protocols like HTTP/HTTPS.
    • Commonly used to enable applications to communicate with each other over the web. For instance, a weather application for a smartphone may use a web API to request weather data from a remote server.
  • Databases:
    • Enable communication between an application and a database.
    • Examples include SQL-based calls or more abstracted ORM (Object-Relational Mapping) frameworks.

Other APIs include: * Hardware APIs: Used for applications to interact with hardware, such as printers, cameras, or graphics cards.

  • Remote Procedure Calls: Allows a protocol to execute a program or procedure on a remote server rather than on a local system.

  • Library or Framework APIs: Define how to interact with a particular programming library or framework.

  • Operating Systems:Define how software applications request and perform lower-level services performed by an operating system.

Key concepts associated with APIs:

  • Endpoints: Specific functions available through the API. For web APIs, an endpoint is usually a specific URL where the API can be accessed.

  • Methods: Actions that can be performed. In web APIs, these often correspond to HTTP methods like GET, POST, PUT, DELETE.

  • Requests and Responses: The act of asking the API to perform its function is a request, and the data it returns to you is the response.

  • Rate Limits: Restrictions on how often you can call the API, often used to prevent abuse or overloading of the service.

  • Authentication and Authorization: Mechanisms to ensure that only approved users or applications can use the API. Common methods include API keys, OAuth, or JWT.

  • Data Formats: Many web APIs exchange data in a specific format, often JSON or XML.

In today’s digital age, APIs are foundational. They enable the creation of complex, feature-rich applications by allowing different software components to leverage each other’s capabilities seamlessly.

14.6 httr2 package

The httr2 package provides functions to work with HTTP requests. One of the core functions in this package is request, which is used to form request to send to web services. The req_perform function sends the request.

This request function formas a HTTP GET request to the specified URL. Typically, HTTP GET requests are used to retrieve information from a server based on the provided URL.

The function returns an object of class response. This object contains all the details of the server’s response, including status code, headers, and content. You can then use other httr2 functions to extract or interpret information from this response.

Let’s say you want to retrieve a webpage or API endpoint:

library(readr)
library(httr2)
url <- "https://data.cdc.gov/resource/gvsb-yw6g.csv"
response <- request(url) |> req_perform()

We can see the results of the request by looking at the returned object.

response
<httr2_response>
GET https://data.cdc.gov/resource/gvsb-yw6g.csv
Status: 200 OK
Content-Type: text/csv
Body: In memory (111411 bytes)

To extract the body, which is where the data are, we can use resp_body_string and send the result, a comma delimited string, to read_csv.

tab <- response |> resp_body_string() |> read_csv()
Rows: 1000 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): level
dbl  (7): perc_diff, percent_pos, percent_pos_2_week, percent_pos_4_week, nu...
dttm (2): posted, mmwrweek_end

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note it is only 1000 entries. API often limit how much you can download. We can change this by adding a query parameter to thr request. Here we use the req_url_path_append.

tab <- request(url) |> 
   req_url_path_append("?$limit=10000000") |> 
   req_perform() |>
   resp_body_string() |> 
   read_csv()
Rows: 34361 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): level
dbl  (7): perc_diff, percent_pos, percent_pos_2_week, percent_pos_4_week, nu...
dttm (2): posted, mmwrweek_end

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(tab)
[1] 34361    10

The CDC service returns data in csv format. A more common format used by web services is JSON. The CDC also provides data in json format through a the url:

url <- "https://data.cdc.gov/resource/gvsb-yw6g.json"

To extract the data table we use the fromJSON function.

tab <- request(url) |> 
   req_perform() |> 
   resp_body_string() |> 
   fromJSON(flatten = TRUE)

When working with APIs, it’s essential to check the API’s documentation for rate limits, required headers, or authentication methods. The httr package provides tools to handle these requirements, such as setting headers or using OAuth authentication.

14.7 Exercises (will not be included in midterm)

  1. Visit the following web page: https://web.archive.org/web/20181024132313/http://www.stevetheump.com/Payrolls.htm

Notice there are several tables. Say we are interested in comparing the payrolls of teams across the years. The next few exercises take us through the steps needed to do this.

Start by applying what you learned to read in the website into an object called h.

  1. Note that, although not very useful, we can actually see the content of the page by typing:
html_text(h)

The next step is to extract the tables. For this, we can use the html_nodes function. We learned that tables in html are associated with the table node. Use the html_nodes function and the table node to extract the first table. Store it in an object nodes.

  1. The html_nodes function returns a list of objects of class xml_node. We can see the content of each one using, for example, the html_text function. You can see the content for an arbitrarily picked component like this:
html_text(nodes[[8]])

If the content of this object is an html table, we can use the html_table function to convert it to a data frame. Use the html_table function to convert the 8th entry of nodes into a table.

  1. Repeat the above for the first 4 components of nodes. Which of the following are payroll tables:
  1. All of them.
  2. 1
  3. 2
  4. 2-4
  1. Repeat the above for the first last 3 components of nodes. Which of the following is true:
  1. The last entry in nodes shows the average across all teams through time, not payroll per team.
  2. All three are payroll per team tables.
  3. All three are like the first entry, not a payroll table.
  4. All of the above.
  1. We have learned that the first and last entries of nodes are not payroll tables. Redefine nodes so that these two are removed.

  2. We saw in the previous analysis that the first table node is not actually a table. This happens sometimes in html because tables are used to make text look a certain way, as opposed to storing numeric values. Remove the first component and then use sapply and html_table to convert each node in nodes into a table. Note that in this case, sapply will return a list of tables. You can also use lapply to assure that a list is applied.

  3. Look through the resulting tables. Are they all the same? Could we just join them with bind_rows?

  4. Create two tables, call them tab_1 and tab_2 using entries 10 and 19.

  5. Use a full_join function to combine these two tables. Before you do this you will have to fix the missing header problem. You will also need to make the names match.

  6. After joining the tables, you see several NAs. This is because some teams are in one table and not the other. Use the anti_join function to get a better idea of why this is happening.

  7. We see see that one of the problems is that Yankees are listed as both N.Y. Yankees and NY Yankees. In the next section, we will learn efficient approaches to fixing problems like this. Here we can do it “by hand” as follows:

tab_1 <- tab_1 |>
  mutate(Team = ifelse(Team == "N.Y. Yankees", "NY Yankees", Team))

Now join the tables and show only Oakland and the Yankees and the payroll columns.


  1. http://selectorgadget.com/↩︎

  2. https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html↩︎

  3. https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/↩︎

  4. https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/↩︎