6 Tidyverse

library(tidyverse)
library(dslabs)

The tidyverse makes data analysis simpler and code easier to read by sacrificing some flexibility. The general idea is define functions that perform the most common tasks and requiring that they take data frames as first argument and return a data frame: data frame in data frame out.

Another way it makes code more readable is using non standard evaluation. We will define when looking at examples.

6.1 Tidy data

This is tidy:

      country year fertility
1     Germany 1960      2.41
2 South Korea 1960      6.16
3     Germany 1961      2.44
4 South Korea 1961      5.99
5     Germany 1962      2.47
6 South Korea 1962      5.79

Originally, the data was in the following format:

      country 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
1     Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37 2.28 2.17 2.04
2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85 4.73 4.62 4.53

Not tidy.

Part of what we learn in the data wrangling part of the class is to make data tidy.

6.2 Adding a column with `mutate`

murders <- mutate(murders, rate = total/population*100000)

Notice that here we used total and population inside the function, which are objects that are not defined in our workspace. But why don’t we get an error? This is non-standard evaluation where the context is used to know what variable names means.

6.3 Subsetting with `filter`

filter(murders, rate <= 0.71)

          state abb        region population total      rate
1        Hawaii  HI          West    1360301     7 0.5145920
2          Iowa  IA North Central    3046355    21 0.6893484
3 New Hampshire  NH     Northeast    1316470     5 0.3798036
4  North Dakota  ND North Central     672591     4 0.5947151
5       Vermont  VT     Northeast     625741     2 0.3196211

6.4 Selecting columns with `select`

new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)

          state        region      rate
1        Hawaii          West 0.5145920
2          Iowa North Central 0.6893484
3 New Hampshire     Northeast 0.3798036
4  North Dakota North Central 0.5947151
5       Vermont     Northeast 0.3196211

6.5 The pipe: `|>` or `%>%`

We use the pipe to chain a series of operations… for example if we want to select columns and then filter rows we chain like this:

\[ \mbox{original data } \rightarrow \mbox{ select } \rightarrow \mbox{ filter } \]

The code looks like this:

murders |> select(state, region, rate) |> filter(rate <= 0.71)

          state        region      rate
1        Hawaii          West 0.5145920
2          Iowa North Central 0.6893484
3 New Hampshire     Northeast 0.3798036
4  North Dakota North Central 0.5947151
5       Vermont     Northeast 0.3196211

The object on the left of the pipe is used as the first argument for the function on the right.

The second argument becomes the first, the third the second, and so on…

16 |> sqrt() |> log(base = 2)

[1] 2

6.6 Summarizing data

Here is how it works:

murders |> summarize(avg = mean(rate))

       avg
1 2.779125

Let’s compute murder rate for the US. Is the above it?

No the rate is NOT the average of rates.

murders |> summarize(rate = sum(total)/sum(population)*100000)

      rate
1 3.034555

6.6.1 Multiple summaries

We want the median, minimum and max population size:

murders |> summarize(median = median(population), min = min(population), max = max(population))

   median    min      max
1 4339367 563626 37253956

Why don’t we use quantiles?

murders |> summarize(quantiles = quantile(population, c(0.5, 0, 1)))

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

  quantiles
1   4339367
2    563626
3  37253956

For multiple summaries we use reframe

murders |> reframe(quantiles = quantile(population, c(0.5, 0, 1)))

  quantiles
1   4339367
2    563626
3  37253956

However, if we want a column per summary, as the summarize call above, we have to define a function that returns a data frame like this:

median_min_max <- function(x){
  qs <- quantile(x, c(0.5, 0, 1))
  data.frame(median = qs[1], min = qs[2], max = qs[3])
}

Then we can call summarize as above:

murders |> summarize(median_min_max(population))

   median    min      max
1 4339367 563626 37253956

6.6.2 Group then summarize with `group_by`

Let’s compute murder rate by region.

Take a close look at this output? How is it different than the original?

murders |> group_by(region)

# A tibble: 51 × 6
# Groups:   region [4]
   state                abb   region    population total  rate
   <chr>                <chr> <fct>          <dbl> <dbl> <dbl>
 1 Alabama              AL    South        4779736   135  2.82
 2 Alaska               AK    West          710231    19  2.68
 3 Arizona              AZ    West         6392017   232  3.63
 4 Arkansas             AR    South        2915918    93  3.19
 5 California           CA    West        37253956  1257  3.37
 6 Colorado             CO    West         5029196    65  1.29
 7 Connecticut          CT    Northeast    3574097    97  2.71
 8 Delaware             DE    South         897934    38  4.23
 9 District of Columbia DC    South         601723    99 16.5 
10 Florida              FL    South       19687653   669  3.40
# ℹ 41 more rows

Note the Groups: region [4] when we print the object. Although not immediately obvious from its appearance, this is now a special data frame called a grouped data frame, and dplyr functions, in particular summarize, will behave differently when acting on this object.

murders |> group_by(region) |> summarize(rate = sum(total) / sum(population) * 100000)

# A tibble: 4 × 2
  region         rate
  <fct>         <dbl>
1 Northeast      2.66
2 South          3.63
3 North Central  2.73
4 West           2.66

The summarize function applies the summarization to each group separately.

For another example, let’s compute the median, minimum, and maximum population in the four regions of the country using the median_min_max defined above:

murders |> group_by(region) |> summarize(median_min_max(population))

# A tibble: 4 × 4
  region          median    min      max
  <fct>            <dbl>  <dbl>    <dbl>
1 Northeast     3574097  625741 19378102
2 South         4625364  601723 25145561
3 North Central 5495456. 672591 12830632
4 West          2700551  563626 37253956

6.7 ungroup

You can also summarize a variable but not collapse the dataset. We use mutate instead of summarize. Here is an example where we add a column with the population in each region and the number of states in the region, shown for each state. When we do this, we usually want to ungroup before continuing our analysis.

murders |> group_by(region) |> 
  mutate(region_pop = sum(population), n = n()) |>
  ungroup()

# A tibble: 51 × 8
   state                abb   region    population total  rate region_pop     n
   <chr>                <chr> <fct>          <dbl> <dbl> <dbl>      <dbl> <int>
 1 Alabama              AL    South        4779736   135  2.82  115674434    17
 2 Alaska               AK    West          710231    19  2.68   71945553    13
 3 Arizona              AZ    West         6392017   232  3.63   71945553    13
 4 Arkansas             AR    South        2915918    93  3.19  115674434    17
 5 California           CA    West        37253956  1257  3.37   71945553    13
 6 Colorado             CO    West         5029196    65  1.29   71945553    13
 7 Connecticut          CT    Northeast    3574097    97  2.71   55317240     9
 8 Delaware             DE    South         897934    38  4.23  115674434    17
 9 District of Columbia DC    South         601723    99 16.5   115674434    17
10 Florida              FL    South       19687653   669  3.40  115674434    17
# ℹ 41 more rows

6.7.1 `pull`

Tidyverse function always returns a data frame. Even if its just one number.

murders |> 
  summarize(rate = sum(total)/sum(population)*100000) |>
  class()

[1] "data.frame"

To get a number use pull

murders |> 
  summarize(rate = sum(total)/sum(population)*100000) |>
  pull(rate)

[1] 3.034555

6.8 Sorting data frames

States order by rate

murders |> arrange(rate) |> head()

          state abb        region population total      rate
1       Vermont  VT     Northeast     625741     2 0.3196211
2 New Hampshire  NH     Northeast    1316470     5 0.3798036
3        Hawaii  HI          West    1360301     7 0.5145920
4  North Dakota  ND North Central     672591     4 0.5947151
5          Iowa  IA North Central    3046355    21 0.6893484
6         Idaho  ID          West    1567582    12 0.7655102

If we want decreasing we can either use the negative or, for more readability, use desc:

murders |> arrange(desc(rate)) |> head()

                 state abb        region population total      rate
1 District of Columbia  DC         South     601723    99 16.452753
2            Louisiana  LA         South    4533372   351  7.742581
3             Missouri  MO North Central    5988927   321  5.359892
4             Maryland  MD         South    5773552   293  5.074866
5       South Carolina  SC         South    4625364   207  4.475323
6             Delaware  DE         South     897934    38  4.231937

We can use two variables as well:

murders |> arrange(region, desc(rate)) |> head(11)

                  state abb    region population total       rate
1          Pennsylvania  PA Northeast   12702379   457  3.5977513
2            New Jersey  NJ Northeast    8791894   246  2.7980319
3           Connecticut  CT Northeast    3574097    97  2.7139722
4              New York  NY Northeast   19378102   517  2.6679599
5         Massachusetts  MA Northeast    6547629   118  1.8021791
6          Rhode Island  RI Northeast    1052567    16  1.5200933
7                 Maine  ME Northeast    1328361    11  0.8280881
8         New Hampshire  NH Northeast    1316470     5  0.3798036
9               Vermont  VT Northeast     625741     2  0.3196211
10 District of Columbia  DC     South     601723    99 16.4527532
11            Louisiana  LA     South    4533372   351  7.7425810

6.9 Exercises

6.10 Exercises

Let’s redo the exercises from previous chapter but now with tidyverse:

Show the subset of murders showing states with less than 1 per 100,000 deaths. Show all variables.

murders <- mutate(murders, rate = total/population*10^5)
filter(murders, rate < 1)

           state abb        region population total      rate
1         Hawaii  HI          West    1360301     7 0.5145920
2          Idaho  ID          West    1567582    12 0.7655102
3           Iowa  IA North Central    3046355    21 0.6893484
4          Maine  ME     Northeast    1328361    11 0.8280881
5      Minnesota  MN North Central    5303925    53 0.9992600
6  New Hampshire  NH     Northeast    1316470     5 0.3798036
7   North Dakota  ND North Central     672591     4 0.5947151
8         Oregon  OR          West    3831074    36 0.9396843
9   South Dakota  SD North Central     814180     8 0.9825837
10          Utah  UT          West    2763885    22 0.7959810
11       Vermont  VT     Northeast     625741     2 0.3196211
12       Wyoming  WY          West     563626     5 0.8871131

Show the subset of murders showing states with less than 1 per 100,000 deaths and in the West of the US. Don’t show the region variable.

filter(murders, rate < 1 & region == "West")

    state abb region population total      rate
1  Hawaii  HI   West    1360301     7 0.5145920
2   Idaho  ID   West    1567582    12 0.7655102
3  Oregon  OR   West    3831074    36 0.9396843
4    Utah  UT   West    2763885    22 0.7959810
5 Wyoming  WY   West     563626     5 0.8871131

Show the largest state with a rate less than 1 per 100,000.

murders |> filter(rate < 1) |> slice_max(population)

      state abb        region population total    rate
1 Minnesota  MN North Central    5303925    53 0.99926

Show the state with a population of more than 10 million with the lowest rate.

murders |> filter(population > 10^7) |> slice_min(rate)

     state abb    region population total    rate
1 New York  NY Northeast   19378102   517 2.66796

Compute the rate for each region of the US.

murders |> group_by(region) |> summarize(rate = sum(total)/sum(population)*10^5)

# A tibble: 4 × 2
  region         rate
  <fct>         <dbl>
1 Northeast      2.66
2 South          3.63
3 North Central  2.73
4 West           2.66

For the next exercises we will be using the data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s. Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and they complete the health examination component of the survey. Part of the data is made available via the NHANES package. Once you install the NHANES package, you can load the data like this:

library(NHANES)

Check for consistency between Race1 and Race3. Do any rows have different entries?

NHANES |> filter(!is.na(Race1) & !is.na(Race3)) |> 
  filter(as.character(Race1) != as.character(Race3)) |>
  count(Race1, Race3)

# A tibble: 1 × 3
  Race1 Race3     n
  <fct> <fct> <int>
1 Other Asian   288

Define a new race variable that has as few NA and Other as possible.

dat <- NHANES %>% mutate(Race = Race3) |>
  mutate(Race = if_else(is.na(Race), Race1, Race))

Compute proportion of individuals that smoked at the time of the survey, by race category and gender. Keep track of how many people answered the question. Order the result by the number that answered. Read the help file for NHANES carefully before doing this one. To be clear: what proportion of people who have smoked at any time in their life are smoking now.

dat |> group_by(Gender, Race) |> 
  summarize(n = sum(!is.na(Smoke100)), 
            smoke = sum(SmokeNow == "Yes", na.rm = TRUE)) |>
  mutate(smoke = smoke/n) |>
  arrange(desc(n))

`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.

# A tibble: 12 × 4
# Groups:   Gender [2]
   Gender Race         n  smoke
   <fct>  <fct>    <int>  <dbl>
 1 female White     2472 0.185 
 2 male   White     2377 0.219 
 3 female Black      442 0.204 
 4 male   Black      379 0.301 
 5 male   Mexican    339 0.212 
 6 female Mexican    262 0.111 
 7 female Hispanic   218 0.0963
 8 male   Hispanic   198 0.278 
 9 female Other      177 0.203 
10 male   Other      162 0.309 
11 female Asian      112 0.0357
12 male   Asian       97 0.175

Create a new dataset that combines the Mexican and Hispanic, and removes the Other category. Hint: use the function forcats::fct_collapse().

dat <- dat |> 
  mutate(Race = forcats::fct_collapse(Race, Hispanic = c("Hispanic", "Mexican"))) |>
  filter(Race != "Other") |>
  mutate(Race = droplevels(Race))

Recompute proportion of individuals that smoke now by race category and gender. Order by rate of smokers.

dat |> group_by(Gender, Race) |> 
  summarize(n = sum(!is.na(Smoke100)), smoke = sum(SmokeNow == "Yes", na.rm = TRUE)) |>
  mutate(smoke = smoke/n) |>
  arrange(desc(smoke))

`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.

# A tibble: 8 × 4
# Groups:   Gender [2]
  Gender Race         n  smoke
  <fct>  <fct>    <int>  <dbl>
1 male   Black      379 0.301 
2 male   Hispanic   537 0.236 
3 male   White     2377 0.219 
4 female Black      442 0.204 
5 female White     2472 0.185 
6 male   Asian       97 0.175 
7 female Hispanic   480 0.104 
8 female Asian      112 0.0357

Compute the median age by race category and gender order by Age.

dat |> group_by(Gender, Race) |>
  summarize(Age = median(Age)) |>
  arrange(Gender, Age)

`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.

# A tibble: 8 × 3
# Groups:   Gender [2]
  Gender Race       Age
  <fct>  <fct>    <dbl>
1 female Hispanic    27
2 female Black       33
3 female Asian       37
4 female White       42
5 male   Hispanic    28
6 male   Black       30
7 male   Asian       31
8 male   White       40

Now redo the smoking rate calculation by age group. But first, remove individuals with no group and remove any age groups for which less than 10 people answered the question. Within each age group and Gender order by percent that smokes.

res <- dat |> 
  filter(!is.na(AgeDecade)) |>
  group_by(AgeDecade) |> 
  mutate(n = sum(!is.na(Smoke100))) |> 
  ungroup() |>
  filter(n >= 10) |>
  group_by(AgeDecade, Gender, Race) |>
  summarize(n = sum(!is.na(Smoke100)), 
            smoke = sum(SmokeNow == "Yes", na.rm = TRUE), 
            .groups = "drop") |> ## This is similar to running ungroup() in a next step
  mutate(smoke = smoke/n) |>
  arrange(AgeDecade, Gender, desc(smoke))

## Bonus: a plot
res |> ggplot(aes(AgeDecade, smoke, color = Race)) +
  geom_point() + 
  geom_line() + 
  facet_wrap(~Gender)

`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

6.1 Tidy data

6.2 Adding a column with mutate

6.3 Subsetting with filter

6.4 Selecting columns with select

6.5 The pipe: |> or %>%