Dataviz In Practice

2024-10-09

Data visualization in practice

  • In this chapter, we will demonstrate how relatively simple ggplot2 code can create insightful and aesthetically pleasing plots.

  • As motivation we will create plots that help us better understand trends in world health and economics.

  • As we go through our case study, we will describe relevant general data visualization principles and learn concepts such as faceting, time series plots, transformations, and ridge plots.

Case study 1

  • Hans Rosling was the co-founder of the Gapminder Foundation, an organization dedicated to educating the public by using data to dispel common myths about the so-called developing world.

  • The organization uses data to show how actual trends in health and economics contradict the narratives that emanate from sensationalist media coverage of catastrophes, tragedies, and other unfortunate events.

Case study 1

  • As stated in the Gapminder Foundation’s website:

Journalists and lobbyists tell dramatic stories. That’s their job. They tell stories about extraordinary events and unusual people. The piles of dramatic stories pile up in peoples’ minds into an over-dramatic worldview and strong negative stress feelings: “The world is getting worse!”, “It’s we vs them!”, “Other people are strange!”, “The population just keeps growing!” and “Nobody cares!”.

Case study 1

  • This talk is based on two talks by Hans Rosling, one on education and The Best Stats You’ve Ever Seen

  • We use data to attempt to answer the following two questions:

    1. Is it a fair characterization of today’s world to say it is divided into western rich nations and the developing world in Africa, Asia, and Latin America?

    2. Has income inequality across countries worsened during the last 40 years?

Case study 1

  • To answer these questions, we will be using the gapminder dataset provided in dslabs.
library(tidyverse) 
library(dslabs) 
ds_theme_set()
options(digits = 3)
gapminder |> as_tibble() |> head()
# A tibble: 6 × 9
  country    year infant_mortality life_expectancy fertility population      gdp
  <fct>     <int>            <dbl>           <dbl>     <dbl>      <dbl>    <dbl>
1 Albania    1960            115.             62.9      6.19    1636054 NA      
2 Algeria    1960            148.             47.5      7.65   11124892  1.38e10
3 Angola     1960            208              36.0      7.32    5270844 NA      
4 Antigua …  1960             NA              63.0      4.43      54681 NA      
5 Argentina  1960             59.9            65.4      3.11   20619075  1.08e11
6 Armenia    1960             NA              66.9      4.55    1867396 NA      
# ℹ 2 more variables: continent <fct>, region <fct>

Case study 1

  • As done in the New Insights on Poverty video, we start by testing our knowledge.

  • Which country do you think had the highest child mortality rates in 2015?

    1. Sri Lanka or Turkey.

    2. Poland or South Korea.

    3. Malaysia or Russia.

    4. Pakistan or Vietnam.

    5. Thailand or South Africa.

Case study 1

Here is the data:

country inf_mort country inf_mort
Sri Lanka 8.4 Turkey 11.6
Poland 4.5 South Korea 2.9
Malaysia 6.0 Russia 8.2
Pakistan 65.8 Vietnam 17.3
Thailand 10.5 South Africa 33.6

Scatterplots

  • We start by looking at data from about 50 years ago, when perhaps this view was first cemented in our minds.
filter(gapminder, year == 1962) |> 
  ggplot(aes(fertility, life_expectancy)) + 
  geom_point() 

Scatterplots

Scatterplots

  • To confirm that indeed these countries are from the regions we expect, we can use color to represent continent.
filter(gapminder, year == 1962) |> 
  ggplot( aes(fertility, life_expectancy, color = continent)) + 
  geom_point()  

Scatterplots

Faceting

  • To make comparisons, however, side by side plots are preferable.
filter(gapminder, year %in% c(1962, 2012)) |> 
  ggplot(aes(fertility, life_expectancy, col = continent)) + 
  geom_point() + 
  facet_grid(year~continent) 

Faceting

Faceting

  • In this case, there is just one variable and we use . to let facet know that we are not using a second variable:
filter(gapminder, year %in% c(1962, 2012)) |> 
  ggplot(aes(fertility, life_expectancy, col = continent)) + 
  geom_point() + 
  facet_grid(. ~ year) 

Faceting

This plot clearly shows that the majority of countries have moved from the developing world cluster to the western world one.

facet_wrap

The function facet_wrap permits automatically wrap the series of plots so that each display has viewable dimensions:

years <- c(1962, 1980, 1990, 2000, 2012) 
continents <- c("Europe", "Asia") 
gapminder |>  
  filter(year %in% years & continent %in% continents) |> 
  ggplot( aes(fertility, life_expectancy, col = continent)) + 
  geom_point() + 
  facet_wrap(~year)  

facet_wrap

Faceting

  • By default axes ranges are the same across panes.

  • We can make the axes adapt using scales = "free".

years <- c(1962, 1980, 1990, 2000, 2012) 
continents <- c("Europe", "Asia") 
gapminder |>  
  filter(year %in% years & continent %in% continents) |> 
  ggplot( aes(fertility, life_expectancy, col = continent)) + 
  geom_point() + 
  facet_wrap(~year, scales = "free")  
  • You will notice that it is harder to see the pattern.

Faceting

gganimate

library(gganimate)
years <- c(1960:2016)
p <- filter(gapminder, year %in% years & !is.na(group) & 
              !is.na(fertility) & !is.na(life_expectancy)) |>
  mutate(population_in_millions = population/10^6) |>
  ggplot(aes(fertility, y = life_expectancy, col = group, size = population_in_millions)) +
  geom_point(alpha = 0.8) +
  guides(size = "none") +
  theme(legend.title = element_blank()) + 
  coord_cartesian(ylim = c(30, 85)) + 
  labs(title = 'Year: {frame_time}', 
       x = 'Fertility rate (births per woman)', y = 'Life expectancy') +
  transition_time(year) +
  ease_aes('linear')

animate(p, end_pause = 15)
west <- c("Western Europe","Northern Europe","Southern Europe",
          "Northern America","Australia and New Zealand")

gapminder <- gapminder |> 
  mutate(group = case_when(
    region %in% west ~ "The West",
    region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
    region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
    continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
    TRUE ~ "Others"))
gapminder <- gapminder |> 
  mutate(group = factor(group, levels = rev(c("Others", "Latin America", "East Asia","Sub-Saharan Africa", "The West"))))

Time series plots

Here is a trend plot of United States fertility rates:

gapminder |>  
  filter(country == "United States") |>  
  ggplot(aes(year, fertility)) + 
  geom_point() 

Time series plots

Time series plots

  • When the points are regularly and densely spaced, as they are here, we create curves by joining the points with lines, to convey that these data are from a single series, here a country:
gapminder |>  
  filter(country == "United States") |>  
  ggplot(aes(year, fertility)) + 
  geom_line() 

Time series plots

Time series plots

  • This is particularly helpful when we look at two countries.
countries <- c("South Korea", "Germany") 
gapminder |> filter(country %in% countries) |>  
  ggplot(aes(year,fertility)) + 
  geom_line() 

Time series plots

Time series plots

  • This is not the plot that we want.

  • To let ggplot know that there are two curves that need to be made separately, we assign each point to a group, one for each country:

countries <- c("South Korea","Germany") 
gapminder |> filter(country %in% countries & !is.na(fertility)) |>  
  ggplot(aes(year, fertility, group = country)) + 
  geom_line() 

Time series plots

Time series plots

  • But which line goes with which country? We can assign colors to make this distinction.
countries <- c("South Korea","Germany") 
gapminder |> filter(country %in% countries & !is.na(fertility)) |>  
  ggplot(aes(year,fertility, col = country)) + 
  geom_line() 

Time series plots

Time series plots

  • We define a data table with the label locations and then use a second mapping just for these labels:
library(geomtextpath) 
gapminder |>  
  filter(country %in% countries) |>  
  ggplot(aes(year, life_expectancy, col = country, label = country)) + 
  geom_textpath() + 
  theme(legend.position = "none") 

Time series plots

Data transformations

  • We now shift our attention to the the commonly held notion that wealth distribution across the world has become worse during the last decades.

  • When general audiences are asked if poor countries have become poorer and rich countries become richer, the majority answers yes.

  • By using stratification, histograms, smooth densities, and boxplots, we will be able to understand if this is in fact the case.

  • First we learn how transformations can sometimes help provide more informative summaries and plots.

Data transformations

  • GDP measures the market value of goods and services produced by a country in a year.

  • The GDP per person is often used as a rough summary of a country’s wealth.

  • Gapmider adjusts GDP values for inflation and represent current US dollars.

  • We divide this quantity by 365 to obtain the more interpretable measure dollars per day.

gapminder <- gapminder |>   
  mutate(dollars_per_day = gdp/population/365) 

Log transformation

  • Here is a histogram of per day incomes from 1970:
past_year <- 1970 
gapminder |>  
  filter(year == past_year & !is.na(gdp)) |> 
  ggplot(aes(dollars_per_day)) +  
  geom_histogram(binwidth = 1, color = "black") 

Log transformation

Log transformation

  • It might be more informative to quickly be able to see how many countries have average daily incomes of about $1 (extremely poor), $2 (very poor), $4 (poor), $8 (middle), $16 (well off), $32 (rich), $64 (very rich) per day.

  • Here is the distribution if we apply a log base 2 transform:

gapminder |>  
  filter(year == past_year & !is.na(gdp)) |> 
  ggplot(aes(log2(dollars_per_day))) +  
  geom_histogram(binwidth = 1, color = "black") 

Log transformation

Which base?

  • In the dollars per day example, we used base 2 instead of base 10 because the resulting range is easier to interpret.

  • The range of the untransformed values is 0.327, 48.885.

  • In base 10, this turns into a range that includes very few integers: just 0 and 1.

  • With base 2, our range includes -2, -1, 0, 1, 2, 3, 4, and 5.

Which base?

  • It is easier to compute \(2^x\) and \(10^x\) when \(x\) is an integer and between -10 and 10, so we prefer to have smaller integers in the scale.

  • Another consequence of a limited range is that choosing the binwidth is more challenging.

  • With log base 2, we know that a binwidth of 1 will translate to a bin with range \(x\) to \(2x\).

Which base?

  • For an example in which base 10 makes more sense, consider population sizes:
gapminder |>  
  filter(year == past_year) |> 
  ggplot(aes(log10(population))) + 
  geom_histogram(binwidth = 0.5, color = "black") 

Which base?

Transform values or scale?

Suppose we define \(z = \log10(x)\) and want to know what is the value of \(z\) if we see it at the mark ^:

----1----^----2--------3----.

We know that the \(z\) is about 1.5.

But what If the scales are logged, what value of \(x\) marked by ^?

----10---^---100------1000---

We need to compute \(10^{1.5}\), which is a bit of extra work.

Transform values or scale?

  • However, the advantage of showing logged scales is that the original values are displayed in the plot, which are easier to interpret.

  • For example, we would see “32 dollars a day” instead of “5 log base 2 dollars a day”.

Transform values or scale?

  • To show logged sclaes, instead of logging the values first, we apply this layer:
gapminder |>  
  filter(year == past_year & !is.na(gdp)) |> 
  ggplot(aes(dollars_per_day)) +  
  geom_histogram(binwidth = 1, color = "black") + 
  scale_x_continuous(trans = "log2") 

Transform values or scale?

Comparing distributions

  • We reorder the regions by the median value and use a log scale.
gapminder |>  
  filter(year == past_year & !is.na(gdp)) |> 
  mutate(region = reorder(region, dollars_per_day, FUN = median)) |> 
  ggplot(aes(dollars_per_day, region)) + 
  geom_point() + 
  scale_x_continuous(trans = "log2")   

Comparing distributions

Comparing distributions

gapminder <- gapminder |>  
  mutate(group = case_when( 
    region %in% c("Western Europe", "Northern Europe","Southern Europe",  
                    "Northern America",  
                  "Australia and New Zealand") ~ "West", 
    region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia", 
    region %in% c("Caribbean", "Central America",  
                  "South America") ~ "Latin America", 
    continent == "Africa" &  
      region != "Northern Africa" ~ "Sub-Saharan", 
    TRUE ~ "Others")) |>  
  mutate(group = factor(group, levels = c("Others", "Latin America",  
                                          "East Asia", "Sub-Saharan", 
                                          "West"))) 

Boxplots

p <- gapminder |>  
  filter(year == past_year & !is.na(gdp)) |> 
  mutate(group = reorder(group, dollars_per_day, FUN = median)) |> 
  ggplot(aes(group, dollars_per_day)) + 
  geom_boxplot() + 
  scale_y_continuous(trans = "log2") + 
  xlab("") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  
p 

Boxplots

Boxplots

  • Boxplots have the limitation that by summarizing the data into five numbers, we might miss important characteristics of the data.

  • One way to avoid this is by showing the data.

p + geom_point(alpha = 0.5) 

Boxplots

Ridge plots

  • Showing each individual point does not always reveal important characteristics of the distribution.

  • Boxplots help with this by providing a five-number summary, but this has limitations too.

  • For example, boxplots will not permit us to discover bimodal distributions.

Ridge plots

Here is an simulated example:

Ridge plots

  • Here is the income data shown above with boxplots but with a ridge plot.
library(ggridges) 
p <- gapminder |>  
  filter(year == past_year & !is.na(dollars_per_day)) |> 
  mutate(group = reorder(group, dollars_per_day, FUN = median)) |> 
  ggplot(aes(dollars_per_day, group)) +  
  scale_x_continuous(trans = "log2")  
p  + geom_density_ridges()  

Ridge plots

Ridge plots

  • If the number of data points is small enough, we can add them to the ridge plot using the following code:
p + geom_density_ridges(jittered_points = TRUE) 

Ridge plots

Ridge plots

  • To show data points, but without using jitter we can use the following code to add what is referred to as a rug representation of the data.
p + geom_density_ridges(jittered_points = TRUE,  
                        position = position_points_jitter(height = 0), 
                        point_shape = '|', point_size = 3,  
                        point_alpha = 1, alpha = 0.7) 

Ridge plots

Case study 1

We keep countries that exist both in 1970 and 2010:

past_year <- 1970 
present_year <- 2010 
years <- c(past_year, present_year) 
country_list <- gapminder |>  
  filter(year %in% c(present_year, past_year)) |> 
  group_by(country) |> 
  summarize(n = sum(!is.na(dollars_per_day)), .groups = "drop") |> 
  filter(n == 2) |> 
  pull(country) 

Case study 1

  • We can compare the distributions using this code:
gapminder |>  
  filter(year %in% years & country %in% country_list) |> 
  mutate(west = ifelse(group == "West", "West", "Developing")) |> 
  ggplot(aes(dollars_per_day)) + 
  geom_histogram(binwidth = 1, color = "black") + 
  scale_x_continuous(trans = "log2") +  
  facet_grid(year ~ west) 

Case study 1

Case study 1

To see which specific regions improved the most, we can remake the boxplots we made above, but now adding the year 2010 and then using facet to compare the two years.

gapminder |>  
  filter(year %in% years & country %in% country_list) |> 
  mutate(group = reorder(group, dollars_per_day, FUN = median)) |> 
  ggplot(aes(group, dollars_per_day)) + 
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  scale_y_continuous(trans = "log2") + 
  xlab("") + 
  facet_grid(. ~ year) 

Case study 1

Case study 1

To add color:

gapminder |>  
  filter(year %in% years & country %in% country_list) |> 
  mutate(group = reorder(group, dollars_per_day, FUN = median),
         year = factor(year)) |> 
  ggplot(aes(group, dollars_per_day, fill = year)) + 
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  scale_y_continuous(trans = "log2") + 
  xlab("")  

Case study 1

Case study 1

Let’s start by noting that density plots for income distribution in 1970 and 2010 deliver the message that the gap is closing:

gapminder |>  
  filter(year %in% years & country %in% country_list) |> 
  ggplot(aes(dollars_per_day)) + 
  geom_density(fill = "grey") +  
  scale_x_continuous(trans = "log2") +  
  facet_grid(. ~ year) 

Case study 1

Accessing computed variables

Use after_stat to access density.

p <- gapminder |>  
  filter(year %in% years & country %in% country_list) |> 
  mutate(group = ifelse(group == "West", "West", "Developing")) |> 
  ggplot(aes(dollars_per_day, y = after_stat(count), fill = group)) + 
  scale_x_continuous(trans = "log2", limits = c(0.125, 300)) 
p + geom_density(alpha = 0.2) + facet_grid(year ~ .) 

Accessing computed variables

Changing smoothness

We can change the smoothness. We selected 0.75 after trying out several values.

p + geom_density(alpha = 0.2, bw = 0.75) + facet_grid(year ~ .) 

Multiple ridge plot

To visualize if any of the groups defined above are driving this we can quickly make a ridge plot:

gapminder |>  
  filter(year %in% years & !is.na(dollars_per_day)) |> 
  mutate(group = reorder(group, dollars_per_day, FUN = median)) |>
  ggplot(aes(dollars_per_day, group)) +  
  scale_x_continuous(trans = "log2") +  
  geom_density_ridges(bandwidth = 1.5) + 
  facet_grid(. ~ year) 

Multiple ridge plot

Stacking densities

  • Another way to achieve this is by stacking the densities on top of each other:
gapminder |>  
    filter(year %in% years & country %in% country_list) |> 
  group_by(year) |> 
  mutate(weight = population/sum(population)*2) |> 
  ungroup() |> 
  ggplot(aes(dollars_per_day, fill = group)) + 
  scale_x_continuous(trans = "log2", limits = c(0.125, 300)) +  
  geom_density(alpha = 0.2, bw = 0.75, position = "stack") +  
  facet_grid(year ~ .)  

Stacking densities

Weighted densities

Here we weigh the countries by size:

gapminder |>  
  filter(year %in% years & country %in% country_list) |> 
  group_by(year) |> 
  mutate(weight = population/sum(population)*2) |> 
  ungroup() |> 
  ggplot(aes(dollars_per_day, fill = group, weight = weight)) + 
  scale_x_continuous(trans = "log2", limits = c(0.125, 300)) +  
  geom_density(alpha = 0.2, bw = 0.75, position = "stack") + 
  facet_grid(year ~ .)  

Weighted densities

The ecological fallacy

Show the data

Case study 2

the_disease <- "Measles" 
dat <- us_contagious_diseases |> 
  filter(!state %in% c("Hawaii","Alaska") & disease == the_disease) |> 
  mutate(rate = count / population * 10000 * 52 / weeks_reporting) |>  
  mutate(state = reorder(state, ifelse(year <= 1963, rate, NA),  
                         median, na.rm = TRUE))  

Heatmaps

library(RColorBrewer)
dat |> ggplot(aes(year, state, fill = rate)) + 
  geom_tile(color = "grey50") + 
  scale_x_continuous(expand = c(0,0)) + 
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") + 
  geom_vline(xintercept = 1963, col = "blue") + 
  theme_minimal() +   
  theme(panel.grid = element_blank(),  
        legend.position = "bottom",  
        text = element_text(size = 8)) + 
  labs(title = the_disease, x = "", y = "") 

Heatmaps

Using position

We will show each state and the avearge:

avg <- us_contagious_diseases |> 
  filter(disease == the_disease) |> group_by(year) |> 
  summarize(us_rate = sum(count, na.rm = TRUE) /  
              sum(population, na.rm = TRUE) * 10000) 

Using position

dat |>  
  filter(!is.na(rate)) |> 
    ggplot() + 
  geom_line(aes(year, rate, group = state),  color = "grey50",  
            show.legend = FALSE, alpha = 0.2, linewidth = 1) + 
  geom_line(mapping = aes(year, us_rate),  data = avg, linewidth = 1) + 
  scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) +  
  ggtitle("Cases per 10,000 by state") +  
  xlab("") + ylab("") + 
  geom_text(data = data.frame(x = 1955, y = 50),  
            mapping = aes(x, y, label = "US average"),  
            color = "black") +  
  geom_vline(xintercept = 1963, col = "blue") 

Using position