Problem set 3

Published

September 28, 2025

In these exercises, we will explore a subset of the NHANES dataset to investigate potential differences in systolic blood pressure across groups defined by self reported race. You can find more information about the NHANES survey data in reference manual of the NHANES CRAN package.

Grading Information

Problems 1-15 are mandatory and will be included in your grade. Problems 16-18 are optional and provided for additional practice.

Instructions

For each exercise, we want you to write a single line of code using the pipe (|>) to chain together multiple operations. This doesn’t mean the code must fit within 80 characters or be written on a single physical line, but rather that the entire sequence of operations can be executed as one continuous line of code without needing to assign intermediate values or create new variables.

For example, these are three separate lines of code:

x <- 100; x <- sqrt(x); log10(x)

Whereas this is considered one line of code using the pipe:

100 |> 
  sqrt() |> 
  log10()

Generate an html document that shows the code for each exercise.
For the exercises that ask to generate a graph, show the graph as well.
For exercises that require you to display tabular results, use the kable function to format the output as a clean, readable table. Do not display the raw dataframe directly—only show the nicely formatted table using kable.
Use only two significant digits for the numbers displayed in the tables.
Submit both the html and the qmd files using Git.
You will need the following libraries:

library(dplyr)
library(tidyr)
library(forcats)
library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.4.3

library(knitr)

Warning: package 'knitr' was built under R version 4.4.3

library(NHANES)

Warning: package 'NHANES' was built under R version 4.4.3

options(digits = 2)

The .qmd file must be able to render properly on the TFs’ computers. They will already have the necessary packages installed, so there is no need to include code for installing packages. Just focus on writing the code that uses these packages.

Exercises

Filter the NHANES data to only include survey year 2011-2012. Save the resulting data.frame (in tbl or tbl_df format) in dat. This data.frame should have 5,000 rows and 76 columns.

## code here

Compute the average and standard deviation (SD) for the combined systolic blood pressure (SBP) reading for males and females separately. Show us a data frame with two rows (female and male) and two columns (average and SD).

## code here

Because of the large difference in the average between males and females, we will perform the rest of the analysis separately for males and females.

Compute the average and SD for SBP for each race variable in column Race3 for females and males separately. The resulting table should have four columns for sex, race, average, and SD, respectively, and 12 rows (one for each strata). Arrange the result from highest to lowest average.

## code here

Repeat the previous exercise but add two columns to the final table to show a 95% confidence interval. Specifically, add columns with the lower and upper bounds of the interval with names lower and upper, respectively. The formula for these values is

\[ \bar{X} \pm 1.96 \, s / \sqrt{n} \] with \(\bar{X}\) the sample average and \(s\) the sample standard deviation. This table will simply add two more columns to the table generated in the previous exercise: one column for the lower and upper bound, respectively.

## code here

Make a graph of showing the results from the previous exercise. Specifically, plot the averages for each group as points and confidence intervals as error bars (use the geometry geom_errorbar). Order the groups from lowest to highest average (the average of the males and females averages). Use facet_wrap to make a separate plot for females and males. Label your axes with Race and Average respectively, add the title Comparing systolic blood pressure across groups, and the caption Bars represent 95% confidence intervals.

## code here

In the plot above we see that the confidence intervals don’t overlap when comparing the White and Mexican groups. We also see a substantial difference between Mexican and Hispnanic. Before concluding that there is a difference between groups, we will explore if differences in age, a very common confounder, explain the differences.

Create table like the one in the previous exercise but show the average SBP by sex and age group (AgeDecade). The the groups are order chronologically. As before make a separate plot for males and females. Make sure to filter our observations with no AgeDecade listed.

## code here

We note that for both males and females the SBP increases with age. To explore if age is indeed a confounder we need to check if the groups have different age distributions.

Explore the age distributions of each Race3 group to determine if the groups are comparable. Make a histogram of Age for each Race3 group and stack them vertically. Generate two columns of graphs for males and females, respectively. In the histograms, create bins increments of 5 years up to 80.

Below the graph, comment on what notice about the age distributions and how this can explain the difference between the White and Mexican groups.

## code here

Summarize the results shown in the graph by computing the median age for each Race3 group and the percent of individuals that are younger than 18. Order the rows by median age. The resulting data frame should have 6 rows (one for each group) and three columns to denote group, median age, and children respectively.

## code here

Given these results provide an explanation for the difference in systolic pressure is lower when comparing the White and Mexican groups.

When the age distributions differ between two populations, we can’t conclude that there are differences in SBP based just on the population averages. The observed differences are likely due to age differences rather than genetic differences. We will therefore stratify by group and then compare SBP. But before we do this, we might need redefine dat to avoid small groups.

Write a function that computes the number of observations in each gender, age group and race combination. Show the groups with less than five observations. Make sure to remove the rows with no BPSysAve measurments before calculating the number of observations. Show a table with four columns representing gender, age strate, group, and the number of individuals in that group. Make sure to include combinations with 0 individuals (hint: use complete).

## code here

Based on the observations made in the previous exercise, we will redefine dat but with the following:
- As before, include only survey year 2011-2012.
- Remove the observations with no age group reported.
- Remove the 0-9 age group.
- Combine the 60-69 and 70+ age groups into a 60+ group.
- Remove observations reporting Other in Race3.
- Rename the variable Race3 to Race.
Hints:
- Note that the levels in AgeDecade start with a space.
- You can use the fct_collapse function in the forcats to combine factors.

## code here

Crete a plot that shows the average BPS for each age decade. Show the different race groups with color and lines joining them. Generate a two plots, one for males and one for females.

## code here

Based on the plot above pick two groups that you think are consistently different and remake the plot from the previous exercise but just for these two groups, add confidence intervals, and remove the lines. Put the confidence intervals for each age strata next to each other and use color to represent the two groups. Comment on your finding.

## code here

For the two groups that you selected above compute the difference in average BPS between the two groups for each age strata. Show a table with three columns representing age strata, difference for females, difference for males.

## code here

We want to explore if BMI might also be a confounder. Create a table showing the average BMI (BMI) for each Race and Gender combination using the redefined dat from exercise 10. Include the number of observations, average BMI, and standard deviation of BMI for each group. Remove observations with missing BMI values before calculating. Order the results by average BMI from lowest to highest.

## code here

Smoking is another potential confounder for blood pressure. Create a summary table that shows the average SBP by smoking status (Smoke100) for each Race and Gender combination. The table should have columns for race, gender, smoking status (Yes/No), number of observations, average SBP, and standard deviation of SBP. Filter out observations with missing Smoke100 or BPSysAve values, and arrange by race, then gender, then smoking status.

## code here

Optional Exercises

Create a visualization comparing SBP distributions across the three race groups using box plots. Make separate plots for smokers (Smoke100 == "Yes") and non-smokers (Smoke100 == "No"), and use facet_wrap to create separate panels for males and females. Filter out missing values for both BPSysAve and Smoke100. Add appropriate axis labels and a title: “Systolic Blood Pressure by Race and Smoking Status”.

## code here

We want to identify which age-race-gender combinations have insufficient sample sizes for reliable analysis. Create a table showing the count of observations for each combination of AgeDecade, Race, and Gender, but only for combinations that have fewer than 10 observations. Include only observations with non-missing BPSysAve values. The resulting table should have four columns: age decade, race, gender, and count.

## code here

For your final analysis, create a comprehensive summary table that combines multiple factors. For each combination of Race and Gender, calculate the overall average SBP, the average age, the percentage of smokers, and the average BMI. Include the number of observations for each group. Filter out rows with missing values for BPSysAve, Age, Smoke100, or BMI. Round all numeric values to 1 decimal place and arrange by race then gender.

## code here