<- 100; x <- sqrt(x); log10(x) x
Problem set 3
In these exercises, we will explore a subset of the NHANES dataset to investigate potential differences in systolic blood pressure across groups defined by self reported race.
Instructions
For each exercise, we want you to write a single line of code using the pipe (
|>
) to chain together multiple operations. This doesn’t mean the code must fit within 80 characters or be written on a single physical line, but rather that the entire sequence of operations can be executed as one continuous line of code without needing to assign intermediate values or create new variables.For example, these are three separate lines of code:
Whereas this is considered one line of code using the pipe:
100 |>
sqrt() |>
log10()
Generate an html document that shows the code for each exercise.
For the exercises that ask to generate a graph, show the graph as well.
For exercises that require you to display tabular results, use the
kable
function to format the output as a clean, readable table. Do not display the raw dataframe directly—only show the nicely formatted table usingkable
.Use only two significant digits for the numbers displayed in the tables.
Submit both the html and the qmd files using Git.
You will need the following libraries:
library(dplyr)
library(tidyr)
library(forcats)
library(ggplot2)
library(knitr)
library(NHANES)
options(digits = 2)
- The
.qmd
file must be able to render properly on the TFs’ computers. They will already have the necessary packages installed, so there is no need to include code for installing packages. Just focus on writing the code that uses these packages.
Exercises
- Filter the
NHANES
data to only include survey year 2011-2012. Save the resulting table indat
. This table should have 5,000 rows and 76 columns.
## code here
- Compute the average and standard deviation (SD) for the combined systolic blood pressure (SBP) reading for males and females separately. Show us a data frame with two rows (female and male) and two columns (average and SD).
## code here
- Because of the large difference in the average between males and females, we will perform the rest of the analysis separately for males and females.
Compute the average and SD for SBP for each race variable in column Race3
for females and males separately. The resulting table should have four columns for sex, race, average, and SD, respectively, and 12 rows (one for each strata). Arrange the result from highest to lowest average.
## code here
- Repeat the previous exercise but add two columns to the final table to show a 95% confidence interval. Specifically, add columns with the lower and upper bounds of the interval with names
lower
andupper
, respectively. The formula for these values is
\[ \bar{X} \pm 1.96 \, s / \sqrt{n} \] with \(\bar{X}\) the sample average and \(s\) the sample standard deviation. This table will simply add two more columns to the table generated in the previous exercise: one column for the lower and upper bound, respectively.
## code here
- Make a graph of showing the results from the previous exercise. Specifically, plot the averages for each group as points and confidence intervals as error bars (use the geometry
geom_errorbar
). Order the groups from lowest to highest average (the average of the males and females averages). Usefacet_wrap
to make a separate plot for females and males. Label your axes with Race and Average respectively, add the title Comparing systolic blood pressure across groups, and the caption Bars represent 95% confidence intervals.
## code here
- In the plot above we see that the confidence intervals don’t overlap when comparing the
White
andMexican
groups. We also see a substantial difference betweenMexican
andHispnanic
. Before concluding that there is a difference between groups, we will explore if differences in age, a very common confounder, explain the differences.
Create table like the one in the previous exercise but show the average SBP by sex and age group (AgeDecade
). The the groups are order chronologically. As before make a separate plot for males and females. Make sure to filter our observations with no AgeDecade
listed.
## code here
- We note that for both males and females the SBP increases with age. To explore if age is indeed a confounder we need to check if the groups have different age distributions.
Explore the age distributions of each Race3
group to determine if the groups are comparable. Make a histogram of Age
for each Race3
group and stack them vertically. Generate two columns of graphs for males and females, respectively. In the histograms, create bins increments of 5 years up to 80.
Below the graph, comment on what notice about the age distributions and how this can explain the difference between the White and Mexican groups.
## code here
- Summarize the results shown in the graph by compute the median age for each
Race3
group and the percent of individuals that are younger than 18. Order the rows by median age. The resulting data frame should have 6 rows (one for each group) and three columns to denote group, median age, and children respectively.
## code here
Given these results provide an explanation for the difference in systolic pressure is lower when comparing the White
and Mexican
groups.
- When the age distribution between two populations we can’t conclude that there are differences in SBP based just on the population averages. The observed differences are likely due to age differences rather than genetic differences. We will therefore stratify by group and then compare SBP. But before we do this, we might need redefine
dat
to avoid small groups.
Write a function that computes the number of observations in each gender, age group and race combination. Show the groups with less than five observations. Make sure to remove the rows with no BPSysAve measurments before calculating the number of observations. Show a table with four columns representing gender, age strate, group, and the number of individuals in that group. Make sure to include combinations with 0 individuals (hint: use complete
).
## code here
Based on the observations made in the previous exercise, we will redefine
dat
but with the following:- As before, include only survey year 2011-2012.
- Remove the observations with no age group reported.
- Remove the 0-9 age group.
- Combine the 60-69 and 70+ ageroups into a 60+ group.
- Remove observations reporting
Other
inRace3
. - Rename the variable
Race3
toRace
.
Hints:
- Note that the levels in
AgeDecade
start with a space. - You can use the
fct_collapse
function in the forcats to combine factors.
## code here
- Crete a plot that shows the averege BPS for each age decade. Show the different race groups with color and lines joining them. Generate a two plots, one for males and one for females.
## code here
- Based on the plot above pick two groups that you think are consistently different and remake the plot from the previous exercise but just for these two groups, add confidence intervals, and remove the lines. Put the confidence intervals for each age strata next to each other and use color to represent the two groups. Comment on your finding.
## code here
- For the two groups that you selected above compute the difference in average BPS between the two groups for each age strata. Show a table with three columns representing age strata, difference for females, difference for males.
## code here