Joining Tables

2024-09-30

Joining tables

The information we need for a given analysis may not be just in one table.
Here we use a simple examples to illustrate the general challenge of combining tables.

Joining tables

Suppose we want to explore the relationship between population size for US states and electoral votes.
We have the population size in this table:

library(tidyverse) 
library(dslabs) 
head(murders)

       state abb region population total
1    Alabama  AL  South    4779736   135
2     Alaska  AK   West     710231    19
3    Arizona  AZ   West    6392017   232
4   Arkansas  AR  South    2915918    93
5 California  CA   West   37253956  1257
6   Colorado  CO   West    5029196    65

Joining tables

and electoral votes in this one:

head(results_us_election_2016)

         state electoral_votes clinton trump others
1   California              55    61.7  31.6    6.7
2        Texas              38    43.2  52.2    4.5
3      Florida              29    47.8  49.0    3.2
4     New York              29    59.0  36.5    4.5
5     Illinois              20    55.8  38.8    5.4
6 Pennsylvania              20    47.9  48.6    3.6

Joining tables

Just concatenating these two tables together will not work since the order of the states is not the same.

identical(results_us_election_2016$state, murders$state)

[1] FALSE

Joins

The join functions are designed to handle this challenge.
The join functions in the dplyr package make sure that the tables are combined so that matching rows are together.
The general idea is that one needs to identify one or more columns that will serve to match the two tables.
A new table with the combined information is returned.

Joins

Notice what happens if we join the two tables above by state using left_join:

tab <- left_join(murders, results_us_election_2016, by = "state") |> 
  select(-others) |> rename(ev = electoral_votes) 
head(tab)

       state abb region population total ev clinton trump
1    Alabama  AL  South    4779736   135  9    34.4  62.1
2     Alaska  AK   West     710231    19  3    36.6  51.3
3    Arizona  AZ   West    6392017   232 11    45.1  48.7
4   Arkansas  AR  South    2915918    93  6    33.7  60.6
5 California  CA   West   37253956  1257 55    61.7  31.6
6   Colorado  CO   West    5029196    65  9    48.2  43.3

Joins

The data has been successfully joined and we can now, for example, make a plot to explore the relationship:

Joins

We see the relationship is close to linear with about 2 electoral votes for every million persons, but with very small states getting higher ratios.
In practice, it is not always the case that each row in one table has a matching row in the other.
For this reason, we have several versions of join.
To illustrate this challenge, we will take subsets of the tables above.

Joins

Note: These names are based on SQL.

Joins

We create the tables tab1 and tab2 so that they have some states in common but not all:

tab_1 <- slice(murders, 1:6) |> select(state, population) 
tab_2 <- results_us_election_2016 |>  
  filter(state %in% c("Alabama", "Alaska", "Arizona",  
                    "California", "Connecticut", "Delaware")) |>  
  select(state, electoral_votes) |> rename(ev = electoral_votes)

We will use these two tables as examples in the next sections.

Left join

Suppose we want a table like tab_1, but adding electoral votes to whatever states we have available.
For this, we use left_join with tab_1 as the first argument.
We specify which column to use to match with the by argument.

left_join(tab_1, tab_2, by = "state")

       state population ev
1    Alabama    4779736  9
2     Alaska     710231  3
3    Arizona    6392017 11
4   Arkansas    2915918 NA
5 California   37253956 55
6   Colorado    5029196 NA

Note that NAs are added to the two states not appearing in tab_2.

Left join

Also, notice that this function, as well as all the other joins, can receive the first arguments through the pipe:

tab_1 |> left_join(tab_2, by = "state")

Right join

If instead of a table with the same rows as first table, we want one with the same rows as second table, we can use right_join:

tab_1 |> right_join(tab_2, by = "state")

        state population ev
1     Alabama    4779736  9
2      Alaska     710231  3
3     Arizona    6392017 11
4  California   37253956 55
5 Connecticut         NA  7
6    Delaware         NA  3

Now the NAs are in the column coming from tab_1.

Inner join

If we want to keep only the rows that have information in both tables, we use inner_join.
You can think of this as an intersection:

inner_join(tab_1, tab_2, by = "state")

       state population ev
1    Alabama    4779736  9
2     Alaska     710231  3
3    Arizona    6392017 11
4 California   37253956 55

Full join

If we want to keep all the rows and fill the missing parts with NAs, we can use full_join.
You can think of this as a union:

full_join(tab_1, tab_2, by = "state")

        state population ev
1     Alabama    4779736  9
2      Alaska     710231  3
3     Arizona    6392017 11
4    Arkansas    2915918 NA
5  California   37253956 55
6    Colorado    5029196 NA
7 Connecticut         NA  7
8    Delaware         NA  3

Semi join

The semi_join function lets us keep the part of first table for which we have information in the second.
It does not add the columns of the second:

semi_join(tab_1, tab_2, by = "state")

       state population
1    Alabama    4779736
2     Alaska     710231
3    Arizona    6392017
4 California   37253956

Anti join

The function anti_join is the opposite of semi_join.
It keeps the elements of the first table for which there is no information in the second:

anti_join(tab_1, tab_2, by = "state")

     state population
1 Arkansas    2915918
2 Colorado    5029196

Binding

Although we have yet to use it in this book, another common way in which datasets are combined is by binding them.
Unlike the join function, the binding functions do not try to match by a variable, but instead simply combine datasets.
If the datasets don’t match by the appropriate dimensions, one obtains an error.

Binding columns

The dplyr function bind_cols binds two objects by making them columns in a tibble.

bind_cols(a = 1:3, b = 4:6)

# A tibble: 3 × 2
      a     b
  <int> <int>
1     1     4
2     2     5
3     3     6

This function requires that we assign names to the columns. Here we chose a and b.
Note that there is an R-base function cbind with the exact same functionality.

Binding columns

An important difference is that cbind can create different types of objects, while bind_cols always produces a data frame.
bind_cols can also bind two different data frames.

Binding columns

For example, here we break up the tab data frame and then bind them back together:

tab_1 <- tab[, 1:3] 
tab_2 <- tab[, 4:6] 
tab_3 <- tab[, 7:8] 
new_tab <- bind_cols(tab_1, tab_2, tab_3) 
head(new_tab)

       state abb region population total ev clinton trump
1    Alabama  AL  South    4779736   135  9    34.4  62.1
2     Alaska  AK   West     710231    19  3    36.6  51.3
3    Arizona  AZ   West    6392017   232 11    45.1  48.7
4   Arkansas  AR  South    2915918    93  6    33.7  60.6
5 California  CA   West   37253956  1257 55    61.7  31.6
6   Colorado  CO   West    5029196    65  9    48.2  43.3

Binding by rows

The bind_rows function is similar to bind_cols, but binds rows instead of columns:

tab_1 <- tab[1:2,] 
tab_2 <- tab[3:4,] 
bind_rows(tab_1, tab_2)

     state abb region population total ev clinton trump
1  Alabama  AL  South    4779736   135  9    34.4  62.1
2   Alaska  AK   West     710231    19  3    36.6  51.3
3  Arizona  AZ   West    6392017   232 11    45.1  48.7
4 Arkansas  AR  South    2915918    93  6    33.7  60.6

This is based on an R-base function rbind.

Set operators

Another set of commands useful for combining datasets are the set operators.
When applied to vectors, these behave as their names suggest.
Examples are intersect, union, setdiff, and setequal.
However, if the tidyverse, or more specifically dplyr, is loaded, these functions can be used on data frames as opposed to just on vectors.

Intersect

You can take intersections of vectors of any type, such as numeric:

intersect(1:10, 6:15)

[1]  6  7  8  9 10

or characters:

intersect(c("a","b","c"), c("b","c","d"))

[1] "b" "c"

dplyr includes an intersect function that can be applied to tables with the same column names.
This function returns the rows in common between two tables.

Intersect

To make sure we use the dplyr version of intersect rather than the base R version, we can use dplyr::intersect like this:

tab_1 <- tab[1:5,] 
tab_2 <- tab[3:7,] 
dplyr::intersect(tab_1, tab_2)

       state abb region population total ev clinton trump
1    Arizona  AZ   West    6392017   232 11    45.1  48.7
2   Arkansas  AR  South    2915918    93  6    33.7  60.6
3 California  CA   West   37253956  1257 55    61.7  31.6

Union

Similarly union takes the union of vectors.
For example:

union(1:10, 6:15)

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

union(c("a","b","c"), c("b","c","d"))

[1] "a" "b" "c" "d"

Union

dplyr includes a version of union that combines all the rows of two tables with the same column names.

tab_1 <- tab[1:5,] 
tab_2 <- tab[3:7,] 
dplyr::union(tab_1, tab_2)

        state abb    region population total ev clinton trump
1     Alabama  AL     South    4779736   135  9    34.4  62.1
2      Alaska  AK      West     710231    19  3    36.6  51.3
3     Arizona  AZ      West    6392017   232 11    45.1  48.7
4    Arkansas  AR     South    2915918    93  6    33.7  60.6
5  California  CA      West   37253956  1257 55    61.7  31.6
6    Colorado  CO      West    5029196    65  9    48.2  43.3
7 Connecticut  CT Northeast    3574097    97  7    54.6  40.9

`setdiff`

The set difference between a first and second argument can be obtained with setdiff.
Unlike intersect and union, this function is not symmetric:

setdiff(1:10, 6:15)

[1] 1 2 3 4 5

setdiff(6:15, 1:10)

[1] 11 12 13 14 15

`setdiff`

As with the functions shown above, dplyr has a version for data frames:

tab_1 <- tab[1:5,] 
tab_2 <- tab[3:7,] 
dplyr::setdiff(tab_1, tab_2)

    state abb region population total ev clinton trump
1 Alabama  AL  South    4779736   135  9    34.4  62.1
2  Alaska  AK   West     710231    19  3    36.6  51.3

`setequal`

Finally, the function setequal tells us if two sets are the same, regardless of order.
So notice that:

setequal(1:5, 1:6)

[1] FALSE

but:

setequal(1:5, 5:1)

[1] TRUE

`setequal`

The dplyr version checks whether data frames are equal, regardless of order of rows or columns:

dplyr::setequal(tab_1, tab_2)

[1] FALSE

Joining with data.table

The data.table package includes merge, a very efficient function for joining tables.
In tidyverse we joined two tables with left_join:

tab <- left_join(murders, results_us_election_2016, by = "state")

Joining with data.table

In data.table the merge functions works similarly:

library(data.table) 
tab <- merge(murders, results_us_election_2016, by = "state", all.x = TRUE)

Joining with data.table

Instead of defining different functions for the different type of joins, merge uses the the logical arguments all (full join), all.x (left join), and all.y (right join).