<- "https://data.cdc.gov/resource/pwn4-m3yp.json"
api <- request(api) |>
cases_raw req_url_query("$limit" = 10000000) |>
req_perform() |>
resp_body_json(simplifyVector = TRUE)
Problem set 5
Introduction
In this problem set, we aim to use data visualization to explore the following questions:
- Based on SARS-Cov-2 cases, COVID-19 deaths and hospitalizations what periods defined the worst two waves of 2020-2021?
- Did states with higher vaccination rates experience lower COVID-19 death rates?
- Were there regional differences in vaccination rates?
We are not providing definitive answers to these questions but rather generating visualizations that may offer insights.
Objective
We will create a single data frame that contains relevant observations for each jurisdiction, for each Morbidity and Mortality Weekly Report (MMWR) period in 2020 and 2021. The key outcomes of interest are:
- SARS-CoV-2 cases
- COVID-19 hospitalizations
- COVID-19 deaths
- Individuals receiving their first COVID-19 vaccine dose
- Individuals receiving a booster dose
Task Breakdown
Your task is divided into three parts:
- Download the data: Retrieve population data from the US Census API and COVID-19 statistics from the CDC API.
- Wrangle the data: Clean and join the datasets to create a final table containing all the necessary information.
- Create visualizations: Generate graphs to explore potential insights into the questions posed above.
Instructions
- Create a Git repository that includes the following directories:
data
code
figs
- Inside the
code
directory, include the following files:funcs.R
wrangle.R
analysis.qmd
- The
figs
directory should contain three PNG files, with each file corresponding to one of the figures you are asked to create.
Detailed instructions follow for each of the tasks.
Download data
For this part we want the following:
- Save all your code in a file called
wrangle.R
that produces the final data frame. - When executed, this code should save the final data frame in an RDA file in the
data
directory.
Copy the relevant code from the previous homework to create the
population
data frame. Put this code in the thewrangle.R
file in thecode
directory. Comment the code so we know where the population is create, where the regions are read in, and where we combine these.In the previous problem set we wrote the following script to download cases data:
We are now going to download three other datasets from CDC that provide hospitalization, provisional COVID deaths, and vaccine data. A different endpoint is provided for each one, but the requests are the same otherwise. To avoid rewriting the same code more than once, write a function called get_cdc_data
that receives and endpoint and returns a data frame. Save this code in a file called functions.R
.
- Use the
get_cdc
Download the cases, hospitalization, deaths, and vaccination data and save the data frames. We recommend saving them into objects called:cases_raw
,hosp_raw
,deaths_raw
, andvax_raw
.
- cases -
https://data.cdc.gov/resource/pwn4-m3yp.json
- hospitalizations -
https://data.cdc.gov/resource/39z2-9zu6.json
- deaths -
https://data.cdc.gov/resource/r8kw-7aab.json
- vaccinations
https://data.cdc.gov/resource/rh2h-3yt2.json
We recommend saving them into objects called: cases_raw
, hosp_raw
, deaths_raw
, and vax_raw
. Add the code to the wranling.R
file. Add comments to describe we read in data here.
Wrangling Challenge
In this section, you will wrangle the files downloaded in the previous step into a single data frame containing all the necessary information. We recommend using the following column names: date
, state
, cases
, hosp
, deaths
, vax
, booster
, and population
.
Key Considerations
Align reporting periods: Ensure that the time periods for which each outcome is reported are consistent. Specifically, calculate the totals for each Morbidity and Mortality Weekly Report (MMWR) period.
Harmonize variable names: To facilitate the joining of datasets, rename variables so that they match across all datasets.
- One challenge is data frames use different column names to represent the same variable. Examine each data frame and report back 1) the name of the column with state abbreviations, 2) if the it’s yearly, monthly, or weekly, daily data, 3) all the column names that provide date information.
Outcome | Jurisdiction variable name | Rate | time variable names |
---|---|---|---|
cases | |||
hospitalizations | |||
deaths | |||
vaccines |
Wrangle the cases data frame to keep state MMWR year, MMWR week, and the total number of cases for that week in that state. Keep only states for which we have population estimates. Hint: Use
as_date
,ymd_hms
,epiweek
andepiyear
functions in the lubridate package. Comment appropriately.Now repeat the same exercise for hospitalizations. Note that you will have to collapse the data into weekly data and keep the same columns as in the cases dataset, except keep total weekly hospitalizations instead of cases. Remove weeks with less than 7 days reporting. Add this code to
wrangle.R
and comment appropriately.Repeat what you did in the previous two exercises for provisional COVID-19 deaths. Add this code to
wrangle.R
and comment appropriately.Repeat this now for vaccination data. Keep the variables
series_complete
andbooster
along with state and date. Add this code towrangle.R
and comment appropriately.Now we are ready to join the tables. We will only consider 2020 and 2021 as we don’t have population sizes for 2022. However, because we want to guarantee that all dates are included we will create a data frame with all possible weeks. Add this code to your
wrangle.R
file. We can use this:
## Make dates data frame
<- data.frame(date = seq(make_date(2020, 1, 25),
all_dates make_date(2021, 12, 31),
by = "week")) |>
mutate(date = ceiling_date(date, unit = "week", week_start = 7) - days(1)) |>
mutate(mmwr_year = epiyear(date), mmwr_week = epiweek(date))
<- cross_join(all_dates, data.frame(state = unique(population$state))) |> left_join(population, by = c("state", "mmwr_year" = "year")) dates_and_pop
Now join all the table to create your final table. Make sure it is ordered by date within each state. Call it dat
and save an RDS file to the data
directory. Add this code to wrangle.R
and comment appropriately.
Data visualization generate some plots
We are now ready to create some figures. In the analysis.qmd
file create a section for each figure. You should load the dat
object stored in the RDS file in the dat
directory.
You can call these sections Figure 1, Figure 2, and so on. Inlcude a short description of what the figure is before the code chunk. The rendered file should show both the code and figure.
Plot a trend plot for cases, hospitalizations and deaths. Plot rates per \(100,000\) people. Place the plots on top of each other. Hint: Use
pivot_longer
andfacet_wrap
.To determine when vaccination started and when most of the population was vaccinated, compute the percent of the US population (including DC and Puerto Rico) were vaccinated by date. Do the same for the booster. Then plot both percentages.
Describe the distribution of vaccination rates on July 1, 2021.
Is there a difference across region? Discuss what the plot shows?
Using the two previous figures, identify two time periods that meet the following criteria:
- A significant COVID-19 wave occurred across the United States.
- A sufficient number of people had been vaccinated.
Next, follow these steps:
- For each state, calculate the COVID-19 deaths per day per 100,000 people during the selected time period.
- Determine the vaccination rate (primary series) in each state as of the last day of the period.
- Create a scatter plot to visualize the relationship between these two variables:
- The x-axis should represent the vaccination rate.
- The y-axis should represent the deaths per day per 100,000 people.
- Repeat the exercise for the booster.