library(HistData)
names(GaltonFamilies)
set.seed(2007)
<- GaltonFamilies |> ## your code here heights
Problem set 8
- Load the HistData package. Create a
galton_height
data with the father’s height and one randomly selected daughter from each family. Exclude families with no female children. Set the seed at 2007 and use the functionsample_n
to select the random child. You should end up with aheights
dataset with two columns:father
anddaughter
.
- Estimate the intercept and slope of the regression line for predicting daughter height \(Y\) using father height \(X\). Use the following regression line formula:
\[ \frac{\hat{Y} - \mu_Y}{\sigma_Y} = \rho \frac{x - \mu_x}{\sigma_x} \]
## your code here
- Make a plot to confirm the regression line goes through the data.
|> ggplot(aes(father, daughter)) + ## your code here heights
- Recompute the slope and intercept coefficients, this time using
lm
and confirm you get the same answer as with the formula used in problem 2.
## your code here
- Note that the interpretation of the intercept is: the height prediction for the daughter whose father is 0 inches tall. This is not a very useful interpretation. Re-run the regression but instead of father height use inches above average for each father: instead of using the \(x_i\)s use \(x_i - \bar{x}\). What is the interpretation of the intercept now? Does the slope estimate change?
##your code here
- When using the centered father heights as a predictor, is the intercept the same as the average daughter height? Check if this is the case with the values you computed and then show that mathematically this has to be the case.
##your code here
For the next exercises install the excessmort package. For the latest version use
library(devtools)
install_github("rafalab/excessmort")
- Define an object
counts
by wranglingpuerto_rico_counts
to 1) include data only from 2002-2017 and counts for people 60 or over. We will focus in this older subset throughout the rest of the problem set.
library(excessmort)
- Use R to determine what day of the week María made landfall in PR (September 20, 2017).
##your code here
- Redefine the date column to be the start of the week that date is part of: in other words, round the date down to the nearest week. Use the day of the week María made landfall as the first day. So, for example, 2017-09-20, 2017-09-21, 2017-09-22 should all be rounded down to 2017-09-20, while 2017-09-19 should be rounded down to 2017-09-13. Save the resulting table in
weekly_counts
.
##your code here
- Now collapse the
weekly_count
data frame to store only one mortality value for each week, for eachsex
andagegroup
. To this by by redefiningoutcome
to have the total deaths that week for eachsex
andagegroup
. Remove weeks that have less the 7 days of data. Finally, add a column with the MMWR week. Name the resulting data frameweekly_counts
.
##your code here
- Comparing mortality totals is often unfair because the two groups begin compared have different population sizes. It is particularly important we consider rates rather than totals in this dataset because the demographics in Puerto Rico changed dramatically in the last 20 years. To see this use
puerto_rico_counts
to plot the population sizes by age group and gender. Provide a two sentence description of what you see.
|> ## your code here puerto_rico_counts
- Make a boxplot for each MMWR week’s mortality rate based on the 2002-2016 data. Each week has 15 data points, one for each year. Then add the 2017 data as red points.
###your code here
- Note twp things: 1) there is a strong week effect and 2) 2017 is lower than expected. Plot the yearly rates (per 1,000) for 2002-2016:
|>
weekly_counts filter(year(date) < 2017) |>
## your code here
- The plot made in 14 explains why 2017 is below what is expected: there appears to be a general decrease in mortality with time. A possible explanation is that medical care is improving and people are living more healthy lives.
Fit a linear model to the weekly data for the 65 and older to the 2002-2016 data that accounts for:
- A changing population.
- The trend observed in 12.
- The week effect.
- Age effect.
- A sex effect.
Use rate as the outcome in the model.
##your code here
- Now obtain expected counts for the entire dataset, including 2017. Compute the difference between the observed count and expected count and plot the total excess death for each week. Construct a confidence interval for the excess mortality estimate for each week. Hint: use the
predict
function.
##your code here
- Finally, plot the observed rates and predicted rates from the model for each
agegroup
andsex
. Comment on how well the model fits and what you might do differently.
##your code here