9 data.table

The data.table package is very powerful and the best package for dealing with large datasets in R. In this course we use the tidyverse because it is easier for begginers. But if you plan to analyze large datasets in the future we highly recommend learning this package.

This guide is an excellent place to start.

This blog post is really useful for those that already know tidyverse. Here we just show some examples.

9.1 Manipulating data tables

data.table is a separate package that needs to be installed. Once installed, we then need to load it:

library(data.table)

We will provide example code showing the data.table approaches to dplyr’s mutate, filter, select, group_by, and summarize shown in the tidyverse chapter. As in that chapter, we will use the murders dataset:

The first step when using data.table is to convert the data frame into a data.table object using the as.data.table function:

murders_dt <- as.data.table(murders)

Without this initial step, most of the approaches shown below will not work.

9.1.1 Selecting

Selecting with data.table is done in a similar way to subsetting matrices. While with dplyr we write:

select(murders, state, region)

In data.table, we use notation similar to what is used with matrices:

murders_dt[, c("state", "region")] |> head()

        state region
1:    Alabama  South
2:     Alaska   West
3:    Arizona   West
4:   Arkansas  South
5: California   West
6:   Colorado   West

We can also use the .() data.table notation to alert R that variables inside the parenthesis are column names, not objects in the R environment. So the above can also be written like this:

murders_dt[, .(state, region)] |> head()

        state region
1:    Alabama  South
2:     Alaska   West
3:    Arizona   West
4:   Arkansas  South
5: California   West
6:   Colorado   West

9.1.2 Adding a column or changing columns

We learned to use the dplyr mutate function with this example:

murders <- mutate(murders, rate = total / population * 100000)

data.table uses an approach that avoids a new assignment (update by reference). This can help with large datasets that take up most of your computer’s memory. The data.table :=` function permits us to do this:

murders_dt[, rate := total / population * 100000]

This adds a new column, rate, to the table. Notice that, as in dplyr, we used total and population without quotes.

We can see that the new column is added:

head(murders_dt)

        state abb region population total     rate
1:    Alabama  AL  South    4779736   135 2.824424
2:     Alaska  AK   West     710231    19 2.675186
3:    Arizona  AZ   West    6392017   232 3.629527
4:   Arkansas  AR  South    2915918    93 3.189390
5: California  CA   West   37253956  1257 3.374138
6:   Colorado  CO   West    5029196    65 1.292453

To define new multiple columns, we can use the := function with multiple arguments:

murders_dt[, ":="(rate = total / population * 100000, rank = rank(population))]

9.1.3 Technical detail: reference versus copy

The data.table package is designed to avoid wasting memory. So if you make a copy of a table, like this:

x <- data.table(a = 1)
y <- x

y is actually referencing x, it is not an new opject: it’s just another name for x. Until you change y, a new object will not be made. However, the := function changes by reference so if you change x, a new object is not made and y continues to be just another name for x:

x[,a := 2]
y

   a
1: 2

You can also change x like this:

y[,a := 1]
x

   a
1: 1

To avoid this, you can use the copy function which forces the creation of an actual copy:

x <- data.table(a = 1)
y <- copy(x)
x[,a := 2]
y

   a
1: 1

Note that the function as.data.table creates a copy of the data frame being converted. However, if working with a large data frames it is helpful to avoid this, and you can do this by using setDT.

x <- data.frame(a = 1)
setDT(x)

However, note that because no copy is being made, be aware that the following code does not create a new object:

x <- data.frame(a = 1)
y <- setDT(x)

The objects x and y are referencing the same data table:

x[,a := 2]
y

   a
1: 2

9.1.4 Subsetting

With dplyr, we filtered like this:

filter(murders, rate <= 0.7)

With data.table, we again use an approach similar to subsetting matrices, except data.table knows that rate refers to a column name and not an object in the R environment:

murders_dt[rate <= 0.7]

           state abb        region population total      rate rank
1:        Hawaii  HI          West    1360301     7 0.5145920   12
2:          Iowa  IA North Central    3046355    21 0.6893484   22
3: New Hampshire  NH     Northeast    1316470     5 0.3798036   10
4:  North Dakota  ND North Central     672591     4 0.5947151    4
5:       Vermont  VT     Northeast     625741     2 0.3196211    3

Notice that we can combine the filter and select into one succint command. Here are the state names and rates for those with rates below 0.7.

murders_dt[rate <= 0.7, .(state, rate)]

           state      rate
1:        Hawaii 0.5145920
2:          Iowa 0.6893484
3: New Hampshire 0.3798036
4:  North Dakota 0.5947151
5:       Vermont 0.3196211

Compare to the dplyr approach:

murders |> filter(rate <= 0.7) |> select(state, rate)

9.2 Summarizing data

As an example, we will use the heights dataset:

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:data.table':

    between, first, last

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(dslabs)
heights_dt <- as.data.table(heights)

In data.table, we can call functions inside .() and they will be applied to rows. So the equivalent of:

s <- heights |> 
  summarize(average = mean(height), standard_deviation = sd(height))

in dplyr is the following in data.table:

s <- heights_dt[, .(average = mean(height), standard_deviation = sd(height))]

Note that this permits a compact way of subsetting and then summarizing. Instead of:

s <- heights |> 
  filter(sex == "Female") |>
  summarize(average = mean(height), standard_deviation = sd(height))

we can write:

s <- heights_dt[sex == "Female", .(average = mean(height), standard_deviation = sd(height))]

9.2.1 Multiple summaries

We previously defined the function:

median_min_max <- function(x){
  qs <- quantile(x, c(0.5, 0, 1))
  data.frame(median = qs[1], minimum = qs[2], maximum = qs[3])
}

Similar to dplyr, we can call this function within .() to obtain the three number summary:

heights_dt[, .(median_min_max(height))]

   median minimum  maximum
1:   68.5      50 82.67717

9.2.2 Group then summarize

The group_by followed by summarize in dplyr is performed in one line in data.table. We simply add the by argument to split the data into groups based on the values in categorical variable:

heights_dt[, .(average = mean(height), standard_deviation = sd(height)), by = sex]

      sex  average standard_deviation
1:   Male 69.31475           3.611024
2: Female 64.93942           3.760656

9.3 Sorting data frames

We can order rows using the same approach we use for filter. Here are the states ordered by murder rate:

murders_dt[order(population)] |> head()

                  state abb        region population total       rate rank
1:              Wyoming  WY          West     563626     5  0.8871131    1
2: District of Columbia  DC         South     601723    99 16.4527532    2
3:              Vermont  VT     Northeast     625741     2  0.3196211    3
4:         North Dakota  ND North Central     672591     4  0.5947151    4
5:               Alaska  AK          West     710231    19  2.6751860    5
6:         South Dakota  SD North Central     814180     8  0.9825837    6

N To sort the table in descending order, we can order by the negative of population or use the decreasing argument:

murders_dt[order(population, decreasing = TRUE)]

9.3.1 Nested sorting

Similarly, we can perform nested ordering by including more than one variable in order

murders_dt[order(region, rate)]

9.4 Optional exercises (will not be included in the midterms)

Load the data.table package and the murders dataset and convert it to data.table object:

library(data.table)
library(dslabs)
murders_dt <- as.data.table(murders)

Remember you can add columns like this:

murders_dt[, population_in_millions := population / 10^6]

Add a murders column named rate with the per 100,000 murder rate as in the example code above.

Add a column rank containing the rank, from highest to lowest murder rate.
If we want to only show the states and population sizes, we can use:

murders_dt[, .(state, population)]

Show the state names and abbreviations in murders.

You can show just the New York row like this:

murders_dt[state == "New York"]

You can use other logical vectors to filter rows.

Show the top 5 states with the highest murder rates. After we add murder rate and rank, do not change the murders dataset, just show the result. Remember that you can filter based on the rank column.

We can remove rows using the != operator. For example, to remove Florida, we would do this:

no_florida <- murders_dt[state != "Florida"]

Create a new data frame called no_south that removes states from the South region. How many states are in this category? You can use the function nrow for this.

We can also use %in% to filter. You can therefore see the data from New York and Texas as follows:

murders_dt[state %in% c("New York", "Texas")]

Create a new data frame called murders_nw with only the states from the Northeast and the West. How many states are in this category?

Suppose you want to live in the Northeast or West and want the murder rate to be less than 1. We want to see the data for the states satisfying these options. Note that you can use logical operators with filter. Here is an example in which we filter to keep only small states in the Northeast region.

murders_dt[population < 5000000 & region == "Northeast"]

Make sure murders has been defined with rate and rank and still has all states. Create a table called my_states that contains rows for states satisfying both the conditions: they are in the Northeast or West and the murder rate is less than 1. Show only the state name, the rate, and the rank.