Tidyverse

2024-09-23

Tidyverse

library(tidyverse)
  • The tidyverse is not a package but a group of packages developed to work with each other.

  • The tidyverse makes data analysis simpler and code easier to read by sacrificing some flexibility.

  • One way code is simplified by ensuring all functions take and return tidy data.

Tidy data

  • Stored in a data frame.

  • Each observation is exactly one row.

  • Variables are stored in columns.

  • Not all data can be represented this way, but a very large subset of data analysis challenges are based on tidy data.

  • Assuming data is tidy simplifies coding and frees up our minds for statistical thinking.

Tidy data

  • This is an example of a tidy dataset:
      country year fertility
1     Germany 1960      2.41
2 South Korea 1960      6.16
3     Germany 1961      2.44
4 South Korea 1961      5.99
5     Germany 1962      2.47
6 South Korea 1962      5.79

Tidy data

  • Originally, the data was in the following format:
      country 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
1     Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37 2.28 2.17 2.04
2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85 4.73 4.62 4.53
  • This is not tidy.

Tidyverse packages

  • tibble - improves data frame class.

  • readr - import data.

  • dplyr - used to modify data frames.

  • ggplot2 - simplifies plotting.

  • tidyr - helps convert data into tidy format.

  • stringr - string processing.

  • forcats - utilities for categorical data.

  • purrr - tidy version of apply functions.

dplyr

  • In this lecture we focus mainly on dplyr.

  • In particular the following functions:

    • mutate

    • select

    • across

    • filter

    • group_by

    • summarize

Adding a column with mutate

murders <- mutate(murders, rate = total/population*100000)
  • Notice that here we used total and population inside the function, which are objects that are not defined in our workspace.

  • This is known as non-standard evaluation where the context is used to know what variable names means.

  • Tidyverse extensively uses non-standard evaluation.

  • This can create confusion but it certainly simplifies code.

Subsetting with filter

filter(murders, rate <= 0.71)
          state abb        region population total      rate
1        Hawaii  HI          West    1360301     7 0.5145920
2          Iowa  IA North Central    3046355    21 0.6893484
3 New Hampshire  NH     Northeast    1316470     5 0.3798036
4  North Dakota  ND North Central     672591     4 0.5947151
5       Vermont  VT     Northeast     625741     2 0.3196211

Selecting columns with select

new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)
          state        region      rate
1        Hawaii          West 0.5145920
2          Iowa North Central 0.6893484
3 New Hampshire     Northeast 0.3798036
4  North Dakota North Central 0.5947151
5       Vermont     Northeast 0.3196211

Transforming variables

  • The function mutate can also be used to transform variables.

  • For example, the following code takes the log transformation of the population variable:

mutate(murders, population = log10(population))

Transforming variables

  • Often, we need to apply the same transformation to several variables.

  • The function across facilitates the operation.

  • For example if want to log transform both population and total murders we can use:

mutate(murders, across(c(population, total), log10))

Transforming variables

  • The helper functions come in handy when using across.

  • An example is if we want to apply the same transformation to all numeric variables:

mutate(murders, across(where(is.numeric), log10))
  • or all character variables:
mutate(murders, across(where(is.character), tolower))
  • There are several other useful helper functions.

The pipe: |> or %>%

  • We use the pipe to chain a series of operations.

  • For example if we want to select columns and then filter rows we chain like this:

\[ \mbox{original data } \rightarrow \mbox{ select } \rightarrow \mbox{ filter } \]

The pipe: |> or %>%

  • The code looks like this:
murders |> select(state, region, rate) |> filter(rate <= 0.71)
          state        region      rate
1        Hawaii          West 0.5145920
2          Iowa North Central 0.6893484
3 New Hampshire     Northeast 0.3798036
4  North Dakota North Central 0.5947151
5       Vermont     Northeast 0.3196211
  • The object on the left of the pipe is used as the first argument for the function on the right.

  • The second argument becomes the first, the third the second, and so on…

The pipe: |> or %>%

  • Here is a simple example:
16 |> sqrt() |> log(base = 2)
[1] 2

Summarizing data

  • We use the dplyr summarize function, not to be confused with summary from R base.

  • Here is an example of how it works:

murders |> summarize(avg = mean(rate))
       avg
1 2.779125
  • Let’s compute murder rate for the US. Is the above it?

Summarizing data

  • No, the rate is NOT the average of rates.

  • It is the total murders divided by total population:

murders |> summarize(rate = sum(total)/sum(population)*100000)
      rate
1 3.034555

Multiple summaries

  • Suppose we want the median, minimum and max population size:
murders |> summarize(median = median(population), min = min(population), max = max(population))
   median    min      max
1 4339367 563626 37253956
  • Why don’t we use quantiles?
murders |> summarize(quantiles = quantile(population, c(0.5, 0, 1)))
  quantiles
1   4339367
2    563626
3  37253956

Multiple summaries

Warning

Using a function that returns more than one number within summarize will soon be deprecated.

  • For multiple summaries we use reframe:
murders |> reframe(quantiles = quantile(population, c(0.5, 0, 1)))
  quantiles
1   4339367
2    563626
3  37253956

Multiple summaries

  • However, if we want a column per summary, as when we called min, median, and max separately, we have to define a function that returns a data frame like this:
median_min_max <- function(x){
  qs <- quantile(x, c(0.5, 0, 1))
  data.frame(median = qs[1], min = qs[2], max = qs[3])
}
  • Then we can call summarize:
murders |> summarize(median_min_max(population))
   median    min      max
1 4339367 563626 37253956

Group then summarize

  • Let’s compute murder rate by region.

  • Take a close look at this output?

murders |> group_by(region) |> head(4)
# A tibble: 4 × 6
# Groups:   region [2]
  state    abb   region population total  rate
  <chr>    <chr> <fct>       <dbl> <dbl> <dbl>
1 Alabama  AL    South     4779736   135  2.82
2 Alaska   AK    West       710231    19  2.68
3 Arizona  AZ    West      6392017   232  3.63
4 Arkansas AR    South     2915918    93  3.19
  • Note the Groups: region [4] at the top.

  • This is a special data frame called a grouped data frame.

Group then summarize

  • In particular summarize, will behave differently when acting on this object.
murders |> 
  group_by(region) |> 
  summarize(rate = sum(total) / sum(population) * 100000)
# A tibble: 4 × 2
  region         rate
  <fct>         <dbl>
1 Northeast      2.66
2 South          3.63
3 North Central  2.73
4 West           2.66
  • The summarize function applies the summarization to each group separately.

Group then summarize

  • For another example, let’s compute the median, minimum, and maximum population in the four regions of the country using the median_min_max previously defined:
murders |> group_by(region) |> summarize(median_min_max(population))
# A tibble: 4 × 4
  region          median    min      max
  <fct>            <dbl>  <dbl>    <dbl>
1 Northeast     3574097  625741 19378102
2 South         4625364  601723 25145561
3 North Central 5495456. 672591 12830632
4 West          2700551  563626 37253956

Group then summarize

  • You can also summarize a variable but not collapse the dataset.

  • We use mutate instead of summarize.

  • Here is an example where we add a column with the population in each region and the number of states in the region, shown for each state.

murders |> group_by(region) |> 
  mutate(region_pop = sum(population), n = n())

ungroup

  • When we do this, we usually want to ungroup before continuing our analysis.
murders |> group_by(region) |> 
  mutate(region_pop = sum(population), n = n()) |>
  ungroup()
  • This avoids having a grouped data frame that we don’t need.

pull

  • Tidyverse function always returns a data frame. Even if its just one number.
murders |> 
  summarize(rate = sum(total)/sum(population)*100000) |>
  class()
[1] "data.frame"

pull

  • To get a numeric use pull:
murders |> 
  summarize(rate = sum(total)/sum(population)*100000) |>
  pull(rate) 
[1] 3.034555

Sorting data frames

  • States order by rate
murders |> arrange(rate) |> head()
          state abb        region population total      rate
1       Vermont  VT     Northeast     625741     2 0.3196211
2 New Hampshire  NH     Northeast    1316470     5 0.3798036
3        Hawaii  HI          West    1360301     7 0.5145920
4  North Dakota  ND North Central     672591     4 0.5947151
5          Iowa  IA North Central    3046355    21 0.6893484
6         Idaho  ID          West    1567582    12 0.7655102

Sorting data frames

  • If we want decreasing we can either use the negative or, for more readability, use desc:
murders |> arrange(desc(rate)) |> head()
                 state abb        region population total      rate
1 District of Columbia  DC         South     601723    99 16.452753
2            Louisiana  LA         South    4533372   351  7.742581
3             Missouri  MO North Central    5988927   321  5.359892
4             Maryland  MD         South    5773552   293  5.074866
5       South Carolina  SC         South    4625364   207  4.475323
6             Delaware  DE         South     897934    38  4.231937

Sorting data frames

  • We can use two variables as well:
murders |> arrange(region, desc(rate)) |> head(11)
                  state abb    region population total       rate
1          Pennsylvania  PA Northeast   12702379   457  3.5977513
2            New Jersey  NJ Northeast    8791894   246  2.7980319
3           Connecticut  CT Northeast    3574097    97  2.7139722
4              New York  NY Northeast   19378102   517  2.6679599
5         Massachusetts  MA Northeast    6547629   118  1.8021791
6          Rhode Island  RI Northeast    1052567    16  1.5200933
7                 Maine  ME Northeast    1328361    11  0.8280881
8         New Hampshire  NH Northeast    1316470     5  0.3798036
9               Vermont  VT Northeast     625741     2  0.3196211
10 District of Columbia  DC     South     601723    99 16.4527532
11            Louisiana  LA     South    4533372   351  7.7425810