Problem set 8

Published

November 16, 2025

Overview

This problem set consists of two main parts:

Part I (Problems 1-6): Linear Regression Fundamentals
You’ll work with the famous Galton height data to understand regression lines, centering predictors, and the relationship between correlation and regression coefficients.

Part II (Problems 7-16): Excess Mortality Analysis
You’ll analyze mortality data from Puerto Rico to estimate excess deaths following Hurricane María in 2017. The goal is to build a statistical model using historical data (2002-2016) to predict how many deaths we would have expected in 2017 without the hurricane, then compare this to what actually happened.

Required Packages

Make sure you have installed: HistData, dplyr, ggplot2, lubridate, and excessmort.

Load the HistData package. Create a heights dataset with the father’s height and one randomly selected daughter from each family. Exclude families with no female children. Set the seed at 2007 and use the function sample_n to select the random child. You should end up with a dataset with two columns: father and daughter.

library(HistData)
library(dplyr)
library(ggplot2)
names(GaltonFamilies)
set.seed(2007)
heights <- GaltonFamilies |> 
  ## your code here

Estimate the intercept and slope of the regression line for predicting daughter height \(Y\) using father height \(X\). Use the following regression line formula (which relates standardized variables through the correlation coefficient \(\rho\)):

\[ \frac{\hat{Y} - \mu_Y}{\sigma_Y} = \rho \frac{x - \mu_x}{\sigma_x} \]

## your code here

Make a plot to confirm the regression line goes through the data.

heights |> ggplot(aes(father, daughter)) + ## your code here

Recompute the slope and intercept coefficients, this time using lm and confirm you get the same answer as with the formula used in problem 2.

## your code here

Note that the interpretation of the intercept is: the height prediction for the daughter whose father is 0 inches tall. This is not a very useful interpretation. Re-run the regression but instead of father height use inches above average for each father: instead of using the \(x_i\)s use \(x_i - \bar{x}\). What is the interpretation of the intercept now? Does the slope estimate change?

## your code here

When using the centered father heights as a predictor, is the intercept the same as the average daughter height? Check if this is the case with the values you computed and then show that mathematically this has to be the case.

## your code here

Part II: Excess Mortality Analysis

For the next exercises install the excessmort package. For the latest version use

library(devtools)
install_github("rafalab/excessmort")

Load the excessmort package and define an object called counts by filtering puerto_rico_counts to include only data from 2002-2017 for people aged 60 or over. We focus on this older population because they are most vulnerable to the health impacts of natural disasters.

library(excessmort) 
library(lubridate)
counts <- puerto_rico_counts |>
  ## your code here

When comparing mortality over time, simply counting deaths can be misleading if the population size changes. It’s particularly important to consider mortality rates (deaths per person) rather than totals because Puerto Rico’s demographics changed dramatically over this period. To see this, use puerto_rico_counts to plot the population sizes by age group and gender for people aged 60+. Provide a two-sentence description of what you observe.

puerto_rico_counts |> 
  ## your code here

Hurricane María made landfall in Puerto Rico on September 20, 2017. Use R to determine what day of the week that was.

## your code here

To reduce day-to-day noise in the data, we’ll aggregate deaths into weekly totals. Redefine the date column to represent the start of the week that each date falls into, using Wednesday as the first day of the week (since María made landfall on a Wednesday). Use floor_date() to round each date down to the nearest Wednesday. Save the result as weekly_counts.

weekly_counts <- counts |>
  ## your code here

Now, collapse the data to get one row per week, sex, and age group combination. Sum up the outcome (total deaths that week) and average the population. Remove any weeks with fewer than 7 days of data. Also add a column for the “MMWR week” - a standard epidemiological week numbering system (1-53) used by the CDC. Overwrite the weekly_counts object with this new, summarized data.

weekly_counts <- weekly_counts |> 
  ## your code here

Create a boxplot showing the distribution of weekly mortality rates for each MMWR week number, using only the 2002-2016 data (each boxplot will have 15 points, one per year). Then add the 2017 data as red points to see how it compares to the historical pattern.

## your code here

You should notice that 2017 mortality rates are often lower than the historical average, especially before the hurricane. To understand why, plot the overall annual mortality rate for each year from 2002-2016.

weekly_counts |> 
  filter(year(date) < 2017) |>
 ## your code here

The previous plot reveals a decreasing trend in mortality rates over time, which explains why 2017 starts lower than expected. Now, fit a linear model to predict the weekly mortality rate using only 2002-2016 data. Your model should account for:

A long-term time trend (year).
Seasonal patterns (MMWR week).
Age differences.
Sex differences.

## your code here

Use your model to predict expected mortality rates for all weeks in the dataset (including 2017). Convert these rates back to expected death counts, then calculate “excess deaths” as observed deaths minus expected deaths. Plot the weekly total excess deaths with a confidence interval.

## your code here

Finally, evaluate your model by plotting observed versus predicted mortality rates for each age group and sex combination. Comment on how well the model fits the pre-hurricane data and suggest one improvement you could make.

## your code here