library(HistData)
library(dplyr)
library(ggplot2)
names(GaltonFamilies)
set.seed(2007)
heights <- GaltonFamilies |>
## your code hereProblem set 8
Overview
This problem set consists of two main parts:
Part I (Problems 1-6): Linear Regression Fundamentals
You’ll work with the famous Galton height data to understand regression lines, centering predictors, and the relationship between correlation and regression coefficients.
Part II (Problems 7-16): Excess Mortality Analysis
You’ll analyze mortality data from Puerto Rico to estimate excess deaths following Hurricane María in 2017. The goal is to build a statistical model using historical data (2002-2016) to predict how many deaths we would have expected in 2017 without the hurricane, then compare this to what actually happened.
Required Packages
Make sure you have installed: HistData, dplyr, ggplot2, lubridate, and excessmort.
- Load the HistData package. Create a
heightsdataset with the father’s height and one randomly selected daughter from each family. Exclude families with no female children. Set the seed at 2007 and use the functionsample_nto select the random child. You should end up with a dataset with two columns:fatheranddaughter.
- Estimate the intercept and slope of the regression line for predicting daughter height \(Y\) using father height \(X\). Use the following regression line formula (which relates standardized variables through the correlation coefficient \(\rho\)):
\[ \frac{\hat{Y} - \mu_Y}{\sigma_Y} = \rho \frac{x - \mu_x}{\sigma_x} \]
## your code here- Make a plot to confirm the regression line goes through the data.
heights |> ggplot(aes(father, daughter)) + ## your code here- Recompute the slope and intercept coefficients, this time using
lmand confirm you get the same answer as with the formula used in problem 2.
## your code here- Note that the interpretation of the intercept is: the height prediction for the daughter whose father is 0 inches tall. This is not a very useful interpretation. Re-run the regression but instead of father height use inches above average for each father: instead of using the \(x_i\)s use \(x_i - \bar{x}\). What is the interpretation of the intercept now? Does the slope estimate change?
## your code here- When using the centered father heights as a predictor, is the intercept the same as the average daughter height? Check if this is the case with the values you computed and then show that mathematically this has to be the case.
## your code herePart II: Excess Mortality Analysis
For the next exercises install the excessmort package. For the latest version use
library(devtools)
install_github("rafalab/excessmort")- Load the
excessmortpackage and define an object calledcountsby filteringpuerto_rico_countsto include only data from 2002-2017 for people aged 60 or over. We focus on this older population because they are most vulnerable to the health impacts of natural disasters.
library(excessmort)
library(lubridate)
counts <- puerto_rico_counts |>
## your code here- When comparing mortality over time, simply counting deaths can be misleading if the population size changes. It’s particularly important to consider mortality rates (deaths per person) rather than totals because Puerto Rico’s demographics changed dramatically over this period. To see this, use
puerto_rico_countsto plot the population sizes by age group and gender for people aged 60+. Provide a two-sentence description of what you observe.
puerto_rico_counts |>
## your code here- Hurricane María made landfall in Puerto Rico on September 20, 2017. Use R to determine what day of the week that was.
## your code here- To reduce day-to-day noise in the data, we’ll aggregate deaths into weekly totals. Redefine the date column to represent the start of the week that each date falls into, using Wednesday as the first day of the week (since María made landfall on a Wednesday). Use
floor_date()to round each date down to the nearest Wednesday. Save the result asweekly_counts.
weekly_counts <- counts |>
## your code here- Now, collapse the data to get one row per week, sex, and age group combination. Sum up the
outcome(total deaths that week) and average thepopulation. Remove any weeks with fewer than 7 days of data. Also add a column for the “MMWR week” - a standard epidemiological week numbering system (1-53) used by the CDC. Overwrite theweekly_countsobject with this new, summarized data.
weekly_counts <- weekly_counts |>
## your code here- Create a boxplot showing the distribution of weekly mortality rates for each MMWR week number, using only the 2002-2016 data (each boxplot will have 15 points, one per year). Then add the 2017 data as red points to see how it compares to the historical pattern.
## your code here- You should notice that 2017 mortality rates are often lower than the historical average, especially before the hurricane. To understand why, plot the overall annual mortality rate for each year from 2002-2016.
weekly_counts |>
filter(year(date) < 2017) |>
## your code here- The previous plot reveals a decreasing trend in mortality rates over time, which explains why 2017 starts lower than expected. Now, fit a linear model to predict the weekly mortality
rateusing only 2002-2016 data. Your model should account for:
- A long-term time trend (year).
- Seasonal patterns (MMWR week).
- Age differences.
- Sex differences.
## your code here- Use your model to predict expected mortality rates for all weeks in the dataset (including 2017). Convert these rates back to expected death counts, then calculate “excess deaths” as observed deaths minus expected deaths. Plot the weekly total excess deaths with a confidence interval.
## your code here- Finally, evaluate your model by plotting observed versus predicted mortality rates for each age group and sex combination. Comment on how well the model fits the pre-hurricane data and suggest one improvement you could make.
## your code here