This homework is due Sunday February 14, 2016 at 11:59 PM. When complete, submit your code in the R Markdown file and the knitted HTML file on Canvas.
This HW is based on Hans Rosling talks New Insights on Poverty and The Best Stats You’ve Ever Seen.
The assignment uses data to answer specific question about global health and economics. The data contradicts commonly held preconceived notions. For example, Hans Rosling starts his talk by asking: (paraphrased) “for each of the six pairs of countries below, which country do you think had the highest child mortality in 2015?”
Most people get them wrong. Why is this? In part it is due to our preconceived notion that the world is divided into two groups: the Western world versus the third world, characterized by “long life,small family” and “short life, large family” respectively. In this homework we will use data visualization to gain insights on this topic.
The first step in our analysis is to download and organize the data. The necessary data to answer these question is available on the gapminder website.
We will use the following datasets:
Create five tbl_df
table objects, one for each of the tables provided in the above files. Hints: Use the read_csv
function. Because these are only temporary files, give them short names.
Write a function called my_func
that takes a table as an argument and returns the column name. For each of the five tables, what is the name of the column containing the country names? Print out the tables or look at them with View
to determine the column.
my_func <- function(x){
names(x)[1]
}
answer <- c( my_func(t1), my_func(t2), my_func(t3), my_func(t4), my_func(t5) )
answer
## [1] "Under five mortality"
## [2] "Life expectancy with projections. Yellow is IHME"
## [3] "Total fertility rate"
## [4] "Total population"
## [5] "GDP (constant 2000 US$)"
In the previous problem we noted that gapminder is inconsistent in naming their country column. Fix this by assigning a common name to this column in the various tables.
the_name <- "country"
names(t1)[1] <- the_name
names(t2)[1] <- the_name
names(t3)[1] <- the_name
names(t4)[1] <- the_name
names(t5)[1] <- the_name
Notice that in these tables, years are represented by columns. We want to create a tidy dataset in which each row is a unit or observation and our 5 values of interest, including the year for that unit, are in the columns. The unit here is a country/year pair and each unit gets values:
value_names
## [1] "child_mortality" "life_expectancy" "fertility" "population"
## [5] "gdp"
We call this the long format. Use the gather
function from the tidyr
package to create a new table for childhood mortality using the long format. Call the new columns year
and child_mortality
gather(t1, key=year, value=child_mortality, -country, convert = TRUE)
## Source: local data frame [59,400 x 3]
##
## country year child_mortality
## (chr) (int) (dbl)
## 1 Abkhazia 1800 NA
## 2 Afghanistan 1800 468.58
## 3 Akrotiri and Dhekelia 1800 NA
## 4 Albania 1800 375.20
## 5 Algeria 1800 460.21
## 6 American Samoa 1800 NA
## 7 Andorra 1800 NA
## 8 Angola 1800 485.68
## 9 Anguilla 1800 NA
## 10 Antigua and Barbuda 1800 473.60
## .. ... ... ...
Now redefine the remaining tables in this way.
t1 <- gather(t1, key=year, value=child_mortality, -country, convert = TRUE)
t2 <- gather(t2, key=year, value=life_expectancy, -country, convert = TRUE)
t3 <- gather(t3, key=year, value=fertility, -country, convert = TRUE)
t4 <- gather(t4, key=year, value=population, -country, convert = TRUE)
t5 <- gather(t5, key=year, value=gdp, -country, convert = TRUE)
Now we want to join all these files together. Make one consolidated table containing all the columns
dat <- t1 %>% full_join(t2) %>% full_join(t3) %>% full_join(t4) %>% full_join(t5)
## Joining by: c("country", "year")
## Joining by: c("country", "year")
## Joining by: c("country", "year")
## Joining by: c("country", "year")
Add a column to the consolidated table containing the continent for each country. Hint: We have created a file that maps countries to continents here. Hint: Learn to use the left_join
function.
map <- read_delim("https://raw.githubusercontent.com/datasciencelabs/data/master/homework_data/continent-info.tsv",col_names=c("country","continent"),delim ="\t")
dat <- left_join(dat, map, continent = continents)
## Joining by: "country"
Advanced solution:
t1 <-read_csv("http://spreadsheets.google.com/pub?key=0ArfEDsV3bBwCcGhBd2NOQVZ1eWowNVpSNjl1c3lRSWc&output=csv")
t2 <- read_csv("http://spreadsheets.google.com/pub?key=phAwcNAVuyj2tPLxKvvnNPA&output=csv")
t3 <- read_csv("http://spreadsheets.google.com/pub?key=phAwcNAVuyj0TAlJeCEzcGQ&output=csv")
t4 <- read_csv("http://spreadsheets.google.com/pub?key=phAwcNAVuyj0XOoBL_n5tAQ&output=csv")
t5 <- read_csv("http://spreadsheets.google.com/pub?key=pyj6tScZqmEfI4sLVvEQtHw&output=csv")
### To keep track of what these are we will define a vector of values
value_names <- c("child_mortality","life_expectancy","fertility","population","gdp")
l <- list(t1,t2,t3,t4,t5)
rm(t1,t2,t3,t4,t5)
gc()
fix_table <- function(tab, val_name){
names(tab)[1] <- "country"
tab <- gather(tab, key=year, value=y, -country, convert = TRUE)
names(tab)[which(names(tab)=="y")] <- val_name
return(tab)
}
for(i in seq_along(l) ) l[[i]] <-fix_table(l[[i]], value_names[i])
dat <- Reduce(full_join, l)
dat <- left_join(dat, map, continent = continents)
Report the child mortalilty rate in 2015 for these 5 pairs:
countries <- c("Sri Lanka","Turkey", "Poland", "South Korea", "Malaysia", "Russia","Pakistan","Vietnam","Thailand","South Africa")
answer <- filter(dat, country%in%countries & year==2015) %>% select(country, child_mortality)
### optionally we can plot them next to each other
answer <- bind_cols( slice( answer, match(countries[ seq(1,9,2)], country)),
slice( answer, match(countries[ seq(2,10,2)], country)))
answer
## Source: local data frame [5 x 4]
##
## country child_mortality country child_mortality
## (chr) (dbl) (chr) (dbl)
## 1 Sri Lanka 8.7 Turkey 13.5
## 2 Poland 5.2 South Korea 3.5
## 3 Malaysia 8.2 Russia 9.6
## 4 Pakistan 81.1 Vietnam 21.7
## 5 Thailand 12.3 South Africa 42.1
To examine if in fact there was a long-life-in-a-small-family and short-life-in-a-large-family dichotomy, we will visualize the average number of children per family (fertility) and the life expectancy for each country.
Use ggplot2
to create a plot of life expectancy versus fertility for 1962 for Africa, Asia, Europe, and the Americas. Use color to denote continent and point size to denote population size:
p <- ggplot( filter(dat, year==1962 & continent != "Oceania"), aes(x=fertility, y=life_expectancy))
p + geom_point( aes( color=continent, size=population))
## Warning: Removed 44 rows containing missing values (geom_point).
Do you see a dichotomy? Explain.
Now we will annotate the plot to show different types of countries.
Learn about OECD and OPEC. Add a couple of columns to your consolidated tables containing a logical vector that tells if a country is OECD and OPEC respectively. It is ok to base membership on 2015.
oecd <- c("Australia","Austria","Belgium","Canada","Chile","Country","Czech Republic","Denmark","Estonia","Finland","France","Germany","Greece","Hungary",
"Iceland","Ireland","Israel","Italy","Japan","Korea","Luxembourg","Mexico","Netherlands","New Zealand","Norway","Poland","Portugal",
"Slovak Republic","Slovenia","Spain","Sweden","Switzerland","Turkey","United Kingdom","United States")
opec <- c("Algeria", "Angola", "Ecuador", "Iran", "Iraq", "Kuwait", "Libya","Nigeria",
"Qatar", "Saudi Arabia", "United Arab Emirates", "Venezuela")
dat <- mutate(dat, oecd = country %in% oecd, opec = country %in% opec)
Make the same plot as in Problem 3.1, but this time use color to annotate the OECD countries and OPEC countries. For countries that are not part of these two organization annotate if they are from Africa, Asia, or the Americas.
group <- ifelse( dat$country%in%oecd , "OECD", dat$continent )
group <- ifelse( dat$country%in%opec , "OPEC", group )
group <- ifelse( group%in%c("Europe","Oceania"), NA, group)
##add group and avoid NAs
dat2 <- mutate(dat, group=group) %>%
filter(!is.na(group) & !is.na(population) & !is.na(fertility) & !is.na(life_expectancy))
p <- ggplot( mutate(dat2, group = group) %>% filter(year==1962),
aes(x=fertility, y=life_expectancy))
p + geom_point( aes( color=group, size=population))
How would you describe the dichotomy?
Explore how this figure changes across time. Show us 4 figures that demonstrate how this figure changes through time.
years <- c(1960, 1980, 2000, 2015)
p <- ggplot( mutate(dat2, group = group) %>% filter(year%in%years),
aes(x=fertility, y=life_expectancy))
p + geom_point( aes( color=group, size=population)) + facet_wrap(~year)
Would you say that the same dichotomy exists today? Explain:
Answer: The dichotomy today is not as clear. If there is a dichotomy it is mainly between sub-saharan Africa and the rest of the world. The transition has been relatively smooth. Extra points: You can see the effects of the AIDS epidemic which may explain why life expectancy is down for some countries in 2000 compared to 1980.
Make an animation with the gganimate
package.
library(gganimate)
p <- ggplot( filter(dat2, year>1950),
aes(x=fertility, y=life_expectancy)) + coord_cartesian(ylim = c(30, 85)) +
geom_point( aes( color=group, size=population, frame=year))
gg_animate(p, "output.mp4")
Having time as a third dimension made it somewhat difficult to see specific country trends. Let’s now focus on specific countries.
Let’s compare France and its former colony Tunisia. Make a plot of fertility versus year with color denoting the country. Do the same for life expecancy. How would you compare Tunisia’s improvement compared to France’s in the past 60 years? Hint: use geom_line
countries <- c("Tunisia","France")
p1 <- ggplot(filter(dat2, country %in% countries & year > 1959),
aes(year,fertility,group=country,color=group)) + geom_line() + theme(legend.position="none")
p2 <- ggplot(filter(dat2, country %in% countries & year > 1959),
aes(year,life_expectancy,group=country,color=group)) + geom_line() + theme(legend.position="none")
library(cowplot) ##this is optional
## Warning: package 'cowplot' was built under R version 3.2.3
##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:ggplot2':
##
## ggsave
g_legend<-function(a.gplot){
tmp <- ggplot_gtable(ggplot_build(a.gplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend)
}
pleg <- ggplot( filter(dat2,country%in%countries & year > 1959), aes(year,fertility,group=country,color=group)) +
geom_line()
pleg = g_legend(pleg)
plot_grid(p1, p2, pleg, ncol = 3)
Do the same, but this time compare Vietnam to the OECD countries.
countries <- c("Vietnam",oecd)
p1 <- ggplot( filter(dat2, country%in%countries & year > 1959),
aes(year,fertility,group=country,color=group)) + geom_line() + theme(legend.position="none")
p2 <- ggplot( filter(dat2, country%in%countries & year > 1959),
aes(year,life_expectancy,group=country,color=group)) + geom_line() + theme(legend.position="none")
library(cowplot) ##this is optional
g_legend<-function(a.gplot){
tmp <- ggplot_gtable(ggplot_build(a.gplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend)
}
pleg <- ggplot( filter(dat2,country%in%countries & year > 1959), aes(year,fertility,group=country,color=group)) + geom_line()
pleg = g_legend(pleg)
plot_grid(p1, p2, pleg,ncol = 3)
We are now going to examine GDP per capita per day.
Create a smooth density estimate of the distribution of GDP per capita per day across countries in 1970. Include OECD, OPEC, Asia, Africa, and the Americas in the computation. When doing this we want to weigh countries with larger populations more. We can do this using the “weight” argument in geom_density
.
dat2 <- mutate(dat, group=group) %>%
filter(!is.na(group) & !is.na(population) & !is.na(gdp))
ggplot(filter(dat2, year==1970 & group%in%c("OECD","OPEC","Asia","Africa","Americas")),
aes(x = gdp/population/365)) +
geom_density(aes(weight=population/sum(population))) + scale_x_log10()
Now do the same but show each of the five groups separately.
dat2 <- mutate(dat, group=group) %>%
filter(!is.na(group) & !is.na(population) & !is.na(gdp))
filter(dat2, year==1970 & group%in%c("OECD","OPEC","Asia","Africa","Americas")) %>%
ggplot(aes(x = gdp/population/365)) +
geom_density(aes(weight=population/sum(population), fill=group), alpha=0.3) + scale_x_log10()
Visualize these densities for several years. Show a couple of of them. Summarize how the distribution has changed through the years.
ggplot(filter(dat2, year%in%c(1970,1990,2010) & group%in%c("OECD","OPEC","Asia","Africa","Americas")),
aes(x = gdp/population/365)) +
geom_density(aes(weight=population/sum(population), fill=group), alpha=0.3) + scale_x_log10() +
facet_grid(year~.)