Tidying data

2024-09-25

Tidying data

  • We will a wide format dataset as an example:
library(tidyverse) 
library(dslabs)
path <- system.file("extdata", package = "dslabs")
filename <- file.path(path, "fertility-two-countries-example.csv")
wide_data <- read_csv(filename)
select(wide_data, 1:10)
# A tibble: 2 × 10
  country     `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968`
  <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Germany       2.41   2.44   2.47   2.49   2.49   2.48   2.44   2.37   2.28
2 South Korea   6.16   5.99   5.79   5.57   5.36   5.16   4.99   4.85   4.73

pivot_longer

  • One of the most used functions in tidyr is pivot_longer.

  • The first argument is a data frame, the one that will be converted.

  • We want to reshape rows represents a fertility observation.

  • We need three columns to store the year, country, and the observed value.

pivot_longer

  • In its current form, data from different years are in different columns with the year values stored in the column names.

  • The names_to and values_to argument tell pivot_longer the column names we want to assign to the columns containing the current column names and observations, respectively.

  • The default names are name and value, in our case a better choice is year and fertility.

pivot_longer

  • Through cols, the second argument, we specify the columns containing observed values; these are the columns that will be pivoted.

  • The default is to pivot all columns so, in most cases, we have to specify the columns. In our example we want columns 1960, 1961 up to 2015.

pivot_longer

The code to pivot the fertility data therefore looks like this:

new_tidy_data <- wide_data |>
  pivot_longer(`1960`:`2015`, names_to = "year", values_to = "fertility")
head(new_tidy_data)
# A tibble: 6 × 3
  country year  fertility
  <chr>   <chr>     <dbl>
1 Germany 1960       2.41
2 Germany 1961       2.44
3 Germany 1962       2.47
4 Germany 1963       2.49
5 Germany 1964       2.49
6 Germany 1965       2.48
  • Data have been converted to tidy format with columns year and fertility.

pivot_longer

  • A quicker way to write this code is to specify which column will not include in the pivot:
new_tidy_data <- wide_data |>
  pivot_longer(-country, names_to = "year", values_to = "fertility")

pivot_longer

  • Now that the data is tidy, we can use it with other tidyverse functions, such ggplot2:
new_tidy_data |> 
  mutate(year = as.numeric(year)) |>
  ggplot(aes(year, fertility, color = country)) + 
  geom_line()

pivot_wider

  • It is sometimes useful for data wrangling purposes to convert tidy data into wide data.

  • We often use this as an intermediate step in tidying up data.

  • The pivot_wider function is basically the inverse of pivot_longer.

pivot_wider

  • The first argument is for the data, but since we are using the pipe, we don’t show it.

  • The names_from argument tells pivot_wider which variable will be used as the column names.

  • The values_from argument specifies which variable to use to fill out the cells.

pivot_wider

Here is some example code

new_wide_data <- new_tidy_data |> 
  pivot_wider(names_from = year, values_from = fertility)
select(new_wide_data, country, `1960`:`1967`)
# A tibble: 2 × 9
  country     `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
  <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Germany       2.41   2.44   2.47   2.49   2.49   2.48   2.44   2.37
2 South Korea   6.16   5.99   5.79   5.57   5.36   5.16   4.99   4.85

Similar to pivot_wider, names_from and values_from default to name and value.

Example

We now demonstrate with a more complex example in which we have to use both pivot_longer and pivot_wider to tidy the data.

path <- system.file("extdata", package = "dslabs")

filename <- "life-expectancy-and-fertility-two-countries-example.csv"
filename <-  file.path(path, filename)

raw_dat <- read_csv(filename)
select(raw_dat, 1:5)
# A tibble: 2 × 5
  country     `1960_fertility` `1960_life_expectancy` `1961_fertility`
  <chr>                  <dbl>                  <dbl>            <dbl>
1 Germany                 2.41                   69.3             2.44
2 South Korea             6.16                   53.0             5.99
# ℹ 1 more variable: `1961_life_expectancy` <dbl>

Example

  • Note that the data is in wide format.

  • Also that this table includes values for two variables, fertility and life expectancy, with the column name encoding which column represents which variable.

  • Encoding information in the column names is not recommended but, unfortunately, it is quite common.

Example

  • We start with the pivot_longer function, but we should no longer use the column name year for the new column since it also contains the variable type.

  • We will call it name, the default, for now:

raw_dat |> pivot_longer(-country) |> head()
# A tibble: 6 × 3
  country name                 value
  <chr>   <chr>                <dbl>
1 Germany 1960_fertility        2.41
2 Germany 1960_life_expectancy 69.3 
3 Germany 1961_fertility        2.44
4 Germany 1961_life_expectancy 69.8 
5 Germany 1962_fertility        2.47
6 Germany 1962_life_expectancy 70.0 

Example

  • The result is not exactly what we refer to as tidy since each observation is associated with two, not one, rows.

  • We want to have the values from the two variables, fertility and life expectancy, in two separate columns.

  • The first challenge to achieve this is to separate the name column into the year and the variable type.

Example

  • Encoding multiple variables in a column name is such a common problem that the tidyr package includes function to separate these columns into two or more:
raw_dat |> 
  pivot_longer(-country) |> 
  separate_wider_delim(name, delim = "_",  names = c("year", "name"), 
                       too_many = "merge") |>
  head()
# A tibble: 6 × 4
  country year  name            value
  <chr>   <chr> <chr>           <dbl>
1 Germany 1960  fertility        2.41
2 Germany 1960  life_expectancy 69.3 
3 Germany 1961  fertility        2.44
4 Germany 1961  life_expectancy 69.8 
5 Germany 1962  fertility        2.47
6 Germany 1962  life_expectancy 70.0 

Example

  • But we are not done yet.

  • We need to create a column for each variable and change year to a number.

  • As we learned, the pivot_wider function can do this.

Example

dat <- raw_dat |> 
  pivot_longer(-country) |> 
  separate_wider_delim(name, delim = "_", 
                       names = c("year", "name"), 
                       too_many = "merge") |>
  pivot_wider() |>
  mutate(year = as.integer(year))
head(dat)
# A tibble: 6 × 4
  country  year fertility life_expectancy
  <chr>   <int>     <dbl>           <dbl>
1 Germany  1960      2.41            69.3
2 Germany  1961      2.44            69.8
3 Germany  1962      2.47            70.0
4 Germany  1963      2.49            70.1
5 Germany  1964      2.49            70.7
6 Germany  1965      2.48            70.6

Example

The data is now is now in tidy form and we can use with other packages:

p1 <- dat |> ggplot(aes(year, fertility, color = country)) + geom_line(show.legend = FALSE)
p2 <- dat |> ggplot(aes(year, life_expectancy, color = country)) + geom_line()
gridExtra::grid.arrange(p1, p2, ncol = 2, widths = c(3,4))