Problem set 3

Published

September 27, 2024

In these exercises, we will explore a subset of the NHANES dataset to investigate potential differences in systolic blood pressure across groups defined by self reported race.

Instructions

For each exercise, we want you to write a single line of code using the pipe (|>) to chain together multiple operations. This doesn’t mean the code must fit within 80 characters or be written on a single physical line, but rather that the entire sequence of operations can be executed as one continuous line of code without needing to assign intermediate values or create new variables.

For example, these are three separate lines of code:

x <- 100; x <- sqrt(x); log10(x)

Whereas this is considered one line of code using the pipe:

100 |> 
  sqrt() |> 
  log10()

Generate an html document that shows the code for each exercise.
For the exercises that ask to generate a graph, show the graph as well.
For exercises that require you to display tabular results, use the kable function to format the output as a clean, readable table. Do not display the raw dataframe directly—only show the nicely formatted table using kable.
Use only two significant digits for the numbers displayed in the tables.
Submit both the html and the qmd files using Git.
You will need the following libraries:

library(dplyr)
library(tidyr)
library(forcats)
library(ggplot2)
library(knitr)
library(NHANES)
options(digits = 2)

The .qmd file must be able to render properly on the TFs’ computers. They will already have the necessary packages installed, so there is no need to include code for installing packages. Just focus on writing the code that uses these packages.

Exercises

Filter the NHANES data to only include survey year 2011-2012. Save the resulting table in dat. This table should have 5,000 rows and 76 columns.

## code here

Compute the average and standard deviation (SD) for the combined systolic blood pressure (SBP) reading for males and females separately. Show us a data frame with two rows (female and male) and two columns (average and SD).

## code here

Because of the large difference in the average between males and females, we will perform the rest of the analysis separately for males and females.

Compute the average and SD for SBP for each race variable in column Race3 for females and males separately. The resulting table should have four columns for sex, race, average, and SD, respectively, and 12 rows (one for each strata). Arrange the result from highest to lowest average.

## code here

Repeat the previous exercise but add two columns to the final table to show a 95% confidence interval. Specifically, add columns with the lower and upper bounds of the interval with names lower and upper, respectively. The formula for these values is

\[ \bar{X} \pm 1.96 \, s / \sqrt{n} \] with \(\bar{X}\) the sample average and \(s\) the sample standard deviation. This table will simply add two more columns to the table generated in the previous exercise: one column for the lower and upper bound, respectively.

## code here

Make a graph of showing the results from the previous exercise. Specifically, plot the averages for each group as points and confidence intervals as error bars (use the geometry geom_errorbar). Order the groups from lowest to highest average (the average of the males and females averages). Use facet_wrap to make a separate plot for females and males. Label your axes with Race and Average respectively, add the title Comparing systolic blood pressure across groups, and the caption Bars represent 95% confidence intervals.

## code here

In the plot above we see that the confidence intervals don’t overlap when comparing the White and Mexican groups. We also see a substantial difference between Mexican and Hispnanic. Before concluding that there is a difference between groups, we will explore if differences in age, a very common confounder, explain the differences.

Create table like the one in the previous exercise but show the average SBP by sex and age group (AgeDecade). The the groups are order chronologically. As before make a separate plot for males and females. Make sure to filter our observations with no AgeDecade listed.

## code here

We note that for both males and females the SBP increases with age. To explore if age is indeed a confounder we need to check if the groups have different age distributions.

Explore the age distributions of each Race3 group to determine if the groups are comparable. Make a histogram of Age for each Race3 group and stack them vertically. Generate two columns of graphs for males and females, respectively. In the histograms, create bins increments of 5 years up to 80.

Below the graph, comment on what notice about the age distributions and how this can explain the difference between the White and Mexican groups.

## code here

Summarize the results shown in the graph by compute the median age for each Race3 group and the percent of individuals that are younger than 18. Order the rows by median age. The resulting data frame should have 6 rows (one for each group) and three columns to denote group, median age, and children respectively.

## code here

Given these results provide an explanation for the difference in systolic pressure is lower when comparing the White and Mexican groups.

When the age distribution between two populations we can’t conclude that there are differences in SBP based just on the population averages. The observed differences are likely due to age differences rather than genetic differences. We will therefore stratify by group and then compare SBP. But before we do this, we might need redefine dat to avoid small groups.

Write a function that computes the number of observations in each gender, age group and race combination. Show the groups with less than five observations. Make sure to remove the rows with no BPSysAve measurments before calculating the number of observations. Show a table with four columns representing gender, age strate, group, and the number of individuals in that group. Make sure to include combinations with 0 individuals (hint: use complete).

## code here

Based on the observations made in the previous exercise, we will redefine dat but with the following:
- As before, include only survey year 2011-2012.
- Remove the observations with no age group reported.
- Remove the 0-9 age group.
- Combine the 60-69 and 70+ ageroups into a 60+ group.
- Remove observations reporting Other in Race3.
- Rename the variable Race3 to Race.
Hints:
- Note that the levels in AgeDecade start with a space.
- You can use the fct_collapse function in the forcats to combine factors.

## code here

Crete a plot that shows the averege BPS for each age decade. Show the different race groups with color and lines joining them. Generate a two plots, one for males and one for females.

## code here

Based on the plot above pick two groups that you think are consistently different and remake the plot from the previous exercise but just for these two groups, add confidence intervals, and remove the lines. Put the confidence intervals for each age strata next to each other and use color to represent the two groups. Comment on your finding.

## code here

For the two groups that you selected above compute the difference in average BPS between the two groups for each age strata. Show a table with three columns representing age strata, difference for females, difference for males.

## code here