library(tidyverse)
library(dslabs)
6 Tidyverse
The tidyverse makes data analysis simpler and code easier to read by sacrificing some flexibility. The general idea is define functions that perform the most common tasks and requiring that they take data frames as first argument and return a data frame: data frame in data frame out.
Another way it makes code more readable is using non standard evaluation. We will define when looking at examples.
6.1 Tidy data
This is tidy:
country year fertility
1 Germany 1960 2.41
2 South Korea 1960 6.16
3 Germany 1961 2.44
4 South Korea 1961 5.99
5 Germany 1962 2.47
6 South Korea 1962 5.79
Originally, the data was in the following format:
country 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37 2.28 2.17 2.04
2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85 4.73 4.62 4.53
Not tidy.
Part of what we learn in the data wrangling part of the class is to make data tidy.
6.2 Adding a column with mutate
<- mutate(murders, rate = total/population*100000) murders
Notice that here we used total
and population
inside the function, which are objects that are not defined in our workspace. But why don’t we get an error? This is non-standard evaluation where the context is used to know what variable names means.
6.3 Subsetting with filter
filter(murders, rate <= 0.71)
state abb region population total rate
1 Hawaii HI West 1360301 7 0.5145920
2 Iowa IA North Central 3046355 21 0.6893484
3 New Hampshire NH Northeast 1316470 5 0.3798036
4 North Dakota ND North Central 672591 4 0.5947151
5 Vermont VT Northeast 625741 2 0.3196211
6.4 Selecting columns with select
<- select(murders, state, region, rate)
new_table filter(new_table, rate <= 0.71)
state region rate
1 Hawaii West 0.5145920
2 Iowa North Central 0.6893484
3 New Hampshire Northeast 0.3798036
4 North Dakota North Central 0.5947151
5 Vermont Northeast 0.3196211
6.5 The pipe: |>
or %>%
We use the pipe to chain a series of operations… for example if we want to select columns and then filter rows we chain like this:
\[ \mbox{original data } \rightarrow \mbox{ select } \rightarrow \mbox{ filter } \]
The code looks like this:
|> select(state, region, rate) |> filter(rate <= 0.71) murders
state region rate
1 Hawaii West 0.5145920
2 Iowa North Central 0.6893484
3 New Hampshire Northeast 0.3798036
4 North Dakota North Central 0.5947151
5 Vermont Northeast 0.3196211
The object on the left of the pipe is used as the first argument for the function on the right.
The second argument becomes the first, the third the second, and so on…
16 |> sqrt() |> log(base = 2)
[1] 2
6.6 Summarizing data
Here is how it works:
|> summarize(avg = mean(rate)) murders
avg
1 2.779125
Let’s compute murder rate for the US. Is the above it?
No the rate is NOT the average of rates.
|> summarize(rate = sum(total)/sum(population)*100000) murders
rate
1 3.034555
6.6.1 Multiple summaries
We want the median, minimum and max population size:
|> summarize(median = median(population), min = min(population), max = max(population)) murders
median min max
1 4339367 563626 37253956
Why don’t we use quantiles
?
|> summarize(quantiles = quantile(population, c(0.5, 0, 1))) murders
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
quantiles
1 4339367
2 563626
3 37253956
For multiple summaries we use reframe
|> reframe(quantiles = quantile(population, c(0.5, 0, 1))) murders
quantiles
1 4339367
2 563626
3 37253956
However, if we want a column per summary, as the summarize
call above, we have to define a function that returns a data frame like this:
<- function(x){
median_min_max <- quantile(x, c(0.5, 0, 1))
qs data.frame(median = qs[1], min = qs[2], max = qs[3])
}
Then we can call summarize
as above:
|> summarize(median_min_max(population)) murders
median min max
1 4339367 563626 37253956
6.6.2 Group then summarize with group_by
Let’s compute murder rate by region.
Take a close look at this output? How is it different than the original?
|> group_by(region) murders
# A tibble: 51 × 6
# Groups: region [4]
state abb region population total rate
<chr> <chr> <fct> <dbl> <dbl> <dbl>
1 Alabama AL South 4779736 135 2.82
2 Alaska AK West 710231 19 2.68
3 Arizona AZ West 6392017 232 3.63
4 Arkansas AR South 2915918 93 3.19
5 California CA West 37253956 1257 3.37
6 Colorado CO West 5029196 65 1.29
7 Connecticut CT Northeast 3574097 97 2.71
8 Delaware DE South 897934 38 4.23
9 District of Columbia DC South 601723 99 16.5
10 Florida FL South 19687653 669 3.40
# ℹ 41 more rows
Note the Groups: region [4]
when we print the object. Although not immediately obvious from its appearance, this is now a special data frame called a grouped data frame, and dplyr functions, in particular summarize
, will behave differently when acting on this object.
|> group_by(region) |> summarize(rate = sum(total) / sum(population) * 100000) murders
# A tibble: 4 × 2
region rate
<fct> <dbl>
1 Northeast 2.66
2 South 3.63
3 North Central 2.73
4 West 2.66
The summarize
function applies the summarization to each group separately.
For another example, let’s compute the median, minimum, and maximum population in the four regions of the country using the median_min_max
defined above:
|> group_by(region) |> summarize(median_min_max(population)) murders
# A tibble: 4 × 4
region median min max
<fct> <dbl> <dbl> <dbl>
1 Northeast 3574097 625741 19378102
2 South 4625364 601723 25145561
3 North Central 5495456. 672591 12830632
4 West 2700551 563626 37253956
6.7 ungroup
You can also summarize a variable but not collapse the dataset. We use mutate
instead of summarize
. Here is an example where we add a column with the population in each region and the number of states in the region, shown for each state. When we do this, we usually want to ungroup before continuing our analysis.
|> group_by(region) |>
murders mutate(region_pop = sum(population), n = n()) |>
ungroup()
# A tibble: 51 × 8
state abb region population total rate region_pop n
<chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <int>
1 Alabama AL South 4779736 135 2.82 115674434 17
2 Alaska AK West 710231 19 2.68 71945553 13
3 Arizona AZ West 6392017 232 3.63 71945553 13
4 Arkansas AR South 2915918 93 3.19 115674434 17
5 California CA West 37253956 1257 3.37 71945553 13
6 Colorado CO West 5029196 65 1.29 71945553 13
7 Connecticut CT Northeast 3574097 97 2.71 55317240 9
8 Delaware DE South 897934 38 4.23 115674434 17
9 District of Columbia DC South 601723 99 16.5 115674434 17
10 Florida FL South 19687653 669 3.40 115674434 17
# ℹ 41 more rows
6.7.1 pull
Tidyverse function always returns a data frame. Even if its just one number.
|>
murders summarize(rate = sum(total)/sum(population)*100000) |>
class()
[1] "data.frame"
To get a number use pull
|>
murders summarize(rate = sum(total)/sum(population)*100000) |>
pull(rate)
[1] 3.034555
6.8 Sorting data frames
States order by rate
|> arrange(rate) |> head() murders
state abb region population total rate
1 Vermont VT Northeast 625741 2 0.3196211
2 New Hampshire NH Northeast 1316470 5 0.3798036
3 Hawaii HI West 1360301 7 0.5145920
4 North Dakota ND North Central 672591 4 0.5947151
5 Iowa IA North Central 3046355 21 0.6893484
6 Idaho ID West 1567582 12 0.7655102
If we want decreasing we can either use the negative or, for more readability, use desc
:
|> arrange(desc(rate)) |> head() murders
state abb region population total rate
1 District of Columbia DC South 601723 99 16.452753
2 Louisiana LA South 4533372 351 7.742581
3 Missouri MO North Central 5988927 321 5.359892
4 Maryland MD South 5773552 293 5.074866
5 South Carolina SC South 4625364 207 4.475323
6 Delaware DE South 897934 38 4.231937
We can use two variables as well:
|> arrange(region, desc(rate)) |> head(11) murders
state abb region population total rate
1 Pennsylvania PA Northeast 12702379 457 3.5977513
2 New Jersey NJ Northeast 8791894 246 2.7980319
3 Connecticut CT Northeast 3574097 97 2.7139722
4 New York NY Northeast 19378102 517 2.6679599
5 Massachusetts MA Northeast 6547629 118 1.8021791
6 Rhode Island RI Northeast 1052567 16 1.5200933
7 Maine ME Northeast 1328361 11 0.8280881
8 New Hampshire NH Northeast 1316470 5 0.3798036
9 Vermont VT Northeast 625741 2 0.3196211
10 District of Columbia DC South 601723 99 16.4527532
11 Louisiana LA South 4533372 351 7.7425810
6.9 Exercises
6.10 Exercises
Let’s redo the exercises from previous chapter but now with tidyverse:
- Show the subset of
murders
showing states with less than 1 per 100,000 deaths. Show all variables.
<- mutate(murders, rate = total/population*10^5)
murders filter(murders, rate < 1)
state abb region population total rate
1 Hawaii HI West 1360301 7 0.5145920
2 Idaho ID West 1567582 12 0.7655102
3 Iowa IA North Central 3046355 21 0.6893484
4 Maine ME Northeast 1328361 11 0.8280881
5 Minnesota MN North Central 5303925 53 0.9992600
6 New Hampshire NH Northeast 1316470 5 0.3798036
7 North Dakota ND North Central 672591 4 0.5947151
8 Oregon OR West 3831074 36 0.9396843
9 South Dakota SD North Central 814180 8 0.9825837
10 Utah UT West 2763885 22 0.7959810
11 Vermont VT Northeast 625741 2 0.3196211
12 Wyoming WY West 563626 5 0.8871131
- Show the subset of
murders
showing states with less than 1 per 100,000 deaths and in the West of the US. Don’t show theregion
variable.
filter(murders, rate < 1 & region == "West")
state abb region population total rate
1 Hawaii HI West 1360301 7 0.5145920
2 Idaho ID West 1567582 12 0.7655102
3 Oregon OR West 3831074 36 0.9396843
4 Utah UT West 2763885 22 0.7959810
5 Wyoming WY West 563626 5 0.8871131
- Show the largest state with a rate less than 1 per 100,000.
|> filter(rate < 1) |> slice_max(population) murders
state abb region population total rate
1 Minnesota MN North Central 5303925 53 0.99926
- Show the state with a population of more than 10 million with the lowest rate.
|> filter(population > 10^7) |> slice_min(rate) murders
state abb region population total rate
1 New York NY Northeast 19378102 517 2.66796
- Compute the rate for each region of the US.
|> group_by(region) |> summarize(rate = sum(total)/sum(population)*10^5) murders
# A tibble: 4 × 2
region rate
<fct> <dbl>
1 Northeast 2.66
2 South 3.63
3 North Central 2.73
4 West 2.66
For the next exercises we will be using the data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s. Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and they complete the health examination component of the survey. Part of the data is made available via the NHANES package. Once you install the NHANES package, you can load the data like this:
library(NHANES)
- Check for consistency between
Race1
andRace3
. Do any rows have different entries?
|> filter(!is.na(Race1) & !is.na(Race3)) |>
NHANES filter(as.character(Race1) != as.character(Race3)) |>
count(Race1, Race3)
# A tibble: 1 × 3
Race1 Race3 n
<fct> <fct> <int>
1 Other Asian 288
- Define a new
race
variable that has as fewNA
andOther
as possible.
<- NHANES %>% mutate(Race = Race3) |>
dat mutate(Race = if_else(is.na(Race), Race1, Race))
- Compute proportion of individuals that smoked at the time of the survey, by race category and gender. Keep track of how many people answered the question. Order the result by the number that answered. Read the help file for NHANES carefully before doing this one. To be clear: what proportion of people who have smoked at any time in their life are smoking now.
|> group_by(Gender, Race) |>
dat summarize(n = sum(!is.na(Smoke100)),
smoke = sum(SmokeNow == "Yes", na.rm = TRUE)) |>
mutate(smoke = smoke/n) |>
arrange(desc(n))
`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.
# A tibble: 12 × 4
# Groups: Gender [2]
Gender Race n smoke
<fct> <fct> <int> <dbl>
1 female White 2472 0.185
2 male White 2377 0.219
3 female Black 442 0.204
4 male Black 379 0.301
5 male Mexican 339 0.212
6 female Mexican 262 0.111
7 female Hispanic 218 0.0963
8 male Hispanic 198 0.278
9 female Other 177 0.203
10 male Other 162 0.309
11 female Asian 112 0.0357
12 male Asian 97 0.175
- Create a new dataset that combines the
Mexican
andHispanic
, and removes theOther
category. Hint: use the functionforcats::fct_collapse()
.
<- dat |>
dat mutate(Race = forcats::fct_collapse(Race, Hispanic = c("Hispanic", "Mexican"))) |>
filter(Race != "Other") |>
mutate(Race = droplevels(Race))
- Recompute proportion of individuals that smoke now by race category and gender. Order by rate of smokers.
|> group_by(Gender, Race) |>
dat summarize(n = sum(!is.na(Smoke100)), smoke = sum(SmokeNow == "Yes", na.rm = TRUE)) |>
mutate(smoke = smoke/n) |>
arrange(desc(smoke))
`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.
# A tibble: 8 × 4
# Groups: Gender [2]
Gender Race n smoke
<fct> <fct> <int> <dbl>
1 male Black 379 0.301
2 male Hispanic 537 0.236
3 male White 2377 0.219
4 female Black 442 0.204
5 female White 2472 0.185
6 male Asian 97 0.175
7 female Hispanic 480 0.104
8 female Asian 112 0.0357
- Compute the median age by race category and gender order by Age.
|> group_by(Gender, Race) |>
dat summarize(Age = median(Age)) |>
arrange(Gender, Age)
`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.
# A tibble: 8 × 3
# Groups: Gender [2]
Gender Race Age
<fct> <fct> <dbl>
1 female Hispanic 27
2 female Black 33
3 female Asian 37
4 female White 42
5 male Hispanic 28
6 male Black 30
7 male Asian 31
8 male White 40
- Now redo the smoking rate calculation by age group. But first, remove individuals with no group and remove any age groups for which less than 10 people answered the question. Within each age group and Gender order by percent that smokes.
<- dat |>
res filter(!is.na(AgeDecade)) |>
group_by(AgeDecade) |>
mutate(n = sum(!is.na(Smoke100))) |>
ungroup() |>
filter(n >= 10) |>
group_by(AgeDecade, Gender, Race) |>
summarize(n = sum(!is.na(Smoke100)),
smoke = sum(SmokeNow == "Yes", na.rm = TRUE),
.groups = "drop") |> ## This is similar to running ungroup() in a next step
mutate(smoke = smoke/n) |>
arrange(AgeDecade, Gender, desc(smoke))
## Bonus: a plot
|> ggplot(aes(AgeDecade, smoke, color = Race)) +
res geom_point() +
geom_line() +
facet_wrap(~Gender)
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?