8 Importing data

8.1 R base functions

We include example data files for practice in the dslabs package. They are stored here:

dir <- system.file("extdata", package = "dslabs")

Take a look:

list.files(dir)

 [1] "2010_bigfive_regents.xls"                               
 [2] "calificaciones.csv"                                     
 [3] "carbon_emissions.csv"                                   
 [4] "fertility-two-countries-example.csv"                    
 [5] "HRlist2.txt"                                            
 [6] "life-expectancy-and-fertility-two-countries-example.csv"
 [7] "murders.csv"                                            
 [8] "olive.csv"                                              
 [9] "RD-Mortality-Report_2015-18-180531.pdf"                 
[10] "ssa-death-probability.csv"

Copy one of them to your working directory:

file_path <- file.path(dir, "murders.csv")
file.copy(file_path, "murders.csv")

[1] TRUE

The file.path function combines characters to form a complete path, ensuring compatibility with the respective operating system. Linux and Mac use forward slashes /, while Windows uses backslashes \, to separate directories. This function is useful because often you want to define paths using a variable.

The file.copy function copies a file and returns TRUE if succesful. If the file exists it will not copy.

What kind of file is it? Although the suffix usually tells us what type of file it is, there is no guarantee that these always match.

readLines("murders.csv", n = 3)

[1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"     
[3] "Alaska,AK,West,710231,19"

It is comma delimited and has a header. You can import it like this:

dat <- read.csv("murders.csv")

There are other importing function in base R: read.table, read.csv and read.delim, for example.

8.2 The readr and readxl packages

Tidyverse has improved versions of functions for importing data.

8.2.1 readr

The readr package includes functions for reading data stored in text file spreadsheets into R. readr is part of the tidyverse package, but you can load it directly using:

library(readr)

The following functions are available to read-in spreadsheets:

Function	Format	Typical suffix
read_table	white space separated values	txt
read_csv	comma separated values	csv
read_csv2	semicolon separated values	csv
read_tsv	tab delimited separated values	tsv
read_delim	general text file format, must define delimiter	txt

the readr equivalent of readLines is read_lines:

read_lines("murders.csv", n_max = 3)

[1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"     
[3] "Alaska,AK,West,710231,19"

From the .csv suffix and the peek at the file, we know to use read_csv:

dat <- read_csv("murders.csv")

Rows: 51 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): state, abb, region
dbl (2): population, total

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note that we receive a message letting us know what data types were used for each column. Also note that dat is a tibble, not just a data frame. This is because read_csv is a tidyverse parser.

A powerful added feature of read_csv is the col_type arguments that let’s you specify the data type of each column before reading. This can help with parsing dates or not letting an error like a letter in a column of numbers turn everything into a character.

8.2.2 readxl

Many spreadsheets are saved in Microsoft Excel format. For this we use parsers in the readxl package:

library(readxl)

The package provides functions to read-in Microsoft Excel formats:

Function	Format	Typical suffix
read_excel	auto detect the format	xls, xlsx
read_xls	original format	xls
read_xlsx	new format	xlsx

The Microsoft Excel formats permit you to have more than one spreadsheet in one file. These are referred to as sheets. The functions listed above read the first sheet by default, but we can also read the others. The excel_sheets function gives us the names of all the sheets in an Excel file. These names can then be passed to the sheet argument in the three functions above to read sheets other than the first.

8.3 Downloading files

A common place for data to reside is on the internet. When these data are in files, we can download them and then import them or even read them directly from the web.

url <- 
  "https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/murders.csv"

The read_csv file can read these files directly:

dat <- read_csv(url)

You can also download the file first using download.file or the Unix commands curl or wget.

tmp_filename <- tempfile()
download.file(url, tmp_filename)
dat <- read_csv(tmp_filename)

Rows: 51 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): state, abb, region
dbl (2): population, total

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

file.remove(tmp_filename)

[1] TRUE

8.4 Encoding

RStudio assumes the Unicode encoding. A common pitfall in data analysis is assuming a file is Unicode when, in fact, it is something else.

To understand encoding, remember that everything on a computer needs to eventually be converted to 0s and 1s. ASCII is an encoding that maps characters to numbers. ASCII uses 7 bits (0s and 1s) which results in \(2^7 = 128\) unique items, enough to encode all the characters on an English language keyboard. However, other languages use characters not included in this encoding. For example, the é in México is not encoded by ASCII. For this reason, a new encoding, using more than 7 bits, was defined: Unicode. When using Unicode, one can chose between 8, 16, and 32 bits abbreviated UTF-8, UTF-16, and UTF-32 respectively. RStudio defaults to UTF-8 encoding. ASCII is a subset of UTF-8.

Try reading in this file:

url <- "https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/calificaciones.csv"
readLines(url, n = 2)

[1] "\"nombre\",\"f.n.\",\"estampa\",\"puntuaci\xf3n\""                       
[2] "\"Beyonc\xe9\",\"04 de septiembre de 1981\",2023-09-22 02:11:02,\"87,5\""

When you see these weird characters the problem is almost always that you are assuming the wrong encoding. You need to be a hacker to figure out, readr has a function that tries:

guess_encoding(url)

# A tibble: 3 × 2
  encoding   confidence
  <chr>           <dbl>
1 ISO-8859-1       0.92
2 ISO-8859-2       0.72
3 ISO-8859-9       0.53

The first guess makes sense as Spanish is often saved using Latin-1 encoding, also known as ISO-8859 encoding because it was the first to include accents and other characters used in Spanish. Once we figure this out we can read in the file correctly:

read_csv(url, locale = locale(encoding = "ISO-8859-1", decimal_mark = ","))

Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): nombre, f.n.
dbl  (1): puntuación
dttm (1): estampa

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 7 × 4
  nombre   f.n.                     estampa             puntuación
  <chr>    <chr>                    <dttm>                   <dbl>
1 Beyoncé  04 de septiembre de 1981 2023-09-22 02:11:02       87.5
2 Blümchen 20 de abril de 1980      2023-09-22 03:23:05       99  
3 João     10 de junio de 1931      2023-09-21 22:43:28       98.9
4 López    24 de julio de 1969      2023-09-22 01:06:59       88.7
5 Ñengo    15 de diciembre de 1981  2023-09-21 23:35:37       93.1
6 Plácido  24 de enero de 1941      2023-09-21 23:17:21       88.7
7 Thalía   26 de agosto de 1971     2023-09-21 23:08:02       83

[1] TRUE

8.5 Exercises

Use the read_csv function to read each of the csv files that the following code saves in the files object. Hint: use the pattern in list.files to keep only the csv files.

library(readr)
path <- system.file("extdata", package = "dslabs")
files <- list.files(path, pattern = ".csv")
res <- lapply(files, function(fn) 
  read_csv(file.path(path, fn), show_col_types = FALSE))

New names:
• `` -> `...1`

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Note that you get a warning. To see which one you can run it one-by-one in a loop:

for (i in seq_along(files)) {
  print(files[i])
  read_csv(file.path(path, files[i]), show_col_types = FALSE)
}

[1] "calificaciones.csv"
[1] "carbon_emissions.csv"
[1] "fertility-two-countries-example.csv"
[1] "life-expectancy-and-fertility-two-countries-example.csv"
[1] "murders.csv"
[1] "olive.csv"

New names:
• `` -> `...1`

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

[1] "ssa-death-probability.csv"

olive.csv gives us a New names warning. This is because the first line of the file is missing the header for the first column.

read_lines(file.path(path, "olive.csv"), n_max = 2)

[1] ",Region,Area,palmitic,palmitoleic,stearic,oleic,linoleic,linolenic,arachidic,eicosenoic"
[2] "1,North-Apulia,1,1,1075,75,226,7823,672,36,60,29"

Read the help file for read_csv to figure out how to read in the file without reading this header. If you skip the header, you should not get this warning. Save the result to an object called dat.

read_csv(file.path(path, "olive.csv"), col_names = FALSE, skip = 1)

Rows: 572 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): X2
dbl (11): X1, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 572 × 12
      X1 X2             X3    X4    X5    X6    X7    X8    X9   X10   X11   X12
   <dbl> <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1 North-Apul…     1     1  1075    75   226  7823   672    36    60    29
 2     2 North-Apul…     1     1  1088    73   224  7709   781    31    61    29
 3     3 North-Apul…     1     1   911    54   246  8113   549    31    63    29
 4     4 North-Apul…     1     1   966    57   240  7952   619    50    78    35
 5     5 North-Apul…     1     1  1051    67   259  7771   672    50    80    46
 6     6 North-Apul…     1     1   911    49   268  7924   678    51    70    44
 7     7 North-Apul…     1     1   922    66   264  7990   618    49    56    29
 8     8 North-Apul…     1     1  1100    61   235  7728   734    39    64    35
 9     9 North-Apul…     1     1  1082    60   239  7745   709    46    83    33
10    10 North-Apul…     1     1  1037    55   213  7944   633    26    52    30
# ℹ 562 more rows

A problem with the previous approach is that we don’t know what the columns represent. Type names(dat) to see that the names are not informative. Use the read_lines with argument n_max=1 to read just the first line.

read_lines(file.path(path, "olive.csv"), n_max = 1)

[1] ",Region,Area,palmitic,palmitoleic,stearic,oleic,linoleic,linolenic,arachidic,eicosenoic"

Notice that you can use this to assign names to the data frame.

colnames <- read_lines(file.path(path, "olive.csv"), n_max = 1) 
colnames <- strsplit(colnames, ",") |> unlist()
colnames[1] <- "row_number"
names(dat) <- colnames

Warning: The `value` argument of `names<-` must have the same length as `x` as of tibble
3.0.0.