dir <- system.file("extdata", package = "dslabs")8 Importing data
8.1 R base functions
We include example data files for practice in the dslabs package. They are stored here:
Take a look:
list.files(dir) [1] "2010_bigfive_regents.xls"
[2] "calificaciones.csv"
[3] "carbon_emissions.csv"
[4] "fertility-two-countries-example.csv"
[5] "HRlist2.txt"
[6] "life-expectancy-and-fertility-two-countries-example.csv"
[7] "murders.csv"
[8] "olive.csv"
[9] "RD-Mortality-Report_2015-18-180531.pdf"
[10] "ssa-death-probability.csv"
Copy one of them to your working directory:
file_path <- file.path(dir, "murders.csv")
file.copy(file_path, "murders.csv")[1] TRUE
The file.path function combines characters to form a complete path, ensuring compatibility with the respective operating system. Linux and Mac use forward slashes /, while Windows uses backslashes \, to separate directories. This function is useful because often you want to define paths using a variable.
The file.copy function copies a file and returns TRUE if succesful. If the file exists it will not copy.
What kind of file is it? Although the suffix usually tells us what type of file it is, there is no guarantee that these always match.
readLines("murders.csv", n = 3)[1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"
[3] "Alaska,AK,West,710231,19"
It is comma delimited and has a header. You can import it like this:
dat <- read.csv("murders.csv")There are other importing function in base R: read.table, read.csv and read.delim, for example.
8.2 The readr and readxl packages
Tidyverse has improved versions of functions for importing data.
8.2.1 readr
The readr package includes functions for reading data stored in text file spreadsheets into R. readr is part of the tidyverse package, but you can load it directly using:
library(readr)The following functions are available to read-in spreadsheets:
| Function | Format | Typical suffix |
|---|---|---|
| read_table | white space separated values | txt |
| read_csv | comma separated values | csv |
| read_csv2 | semicolon separated values | csv |
| read_tsv | tab delimited separated values | tsv |
| read_delim | general text file format, must define delimiter | txt |
the readr equivalent of readLines is read_lines:
read_lines("murders.csv", n_max = 3)[1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"
[3] "Alaska,AK,West,710231,19"
From the .csv suffix and the peek at the file, we know to use read_csv:
dat <- read_csv("murders.csv")Rows: 51 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): state, abb, region
dbl (2): population, total
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note that we receive a message letting us know what data types were used for each column. Also note that dat is a tibble, not just a data frame. This is because read_csv is a tidyverse parser.
A powerful added feature of read_csv is the col_type arguments that let’s you specify the data type of each column before reading. This can help with parsing dates or not letting an error like a letter in a column of numbers turn everything into a character.
8.2.2 readxl
Many spreadsheets are saved in Microsoft Excel format. For this we use parsers in the readxl package:
library(readxl)The package provides functions to read-in Microsoft Excel formats:
| Function | Format | Typical suffix |
|---|---|---|
| read_excel | auto detect the format | xls, xlsx |
| read_xls | original format | xls |
| read_xlsx | new format | xlsx |
The Microsoft Excel formats permit you to have more than one spreadsheet in one file. These are referred to as sheets. The functions listed above read the first sheet by default, but we can also read the others. The excel_sheets function gives us the names of all the sheets in an Excel file. These names can then be passed to the sheet argument in the three functions above to read sheets other than the first.
8.3 Downloading files
A common place for data to reside is on the internet. When these data are in files, we can download them and then import them or even read them directly from the web.
url <-
"https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/murders.csv"The read_csv file can read these files directly:
dat <- read_csv(url)You can also download the file first using download.file or the Unix commands curl or wget.
tmp_filename <- tempfile()
download.file(url, tmp_filename)
dat <- read_csv(tmp_filename)Rows: 51 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): state, abb, region
dbl (2): population, total
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
file.remove(tmp_filename)[1] TRUE
8.4 Encoding
RStudio assumes the Unicode encoding. A common pitfall in data analysis is assuming a file is Unicode when, in fact, it is something else.
To understand encoding, remember that everything on a computer needs to eventually be converted to 0s and 1s. ASCII is an encoding that maps characters to numbers. ASCII uses 7 bits (0s and 1s) which results in \(2^7 = 128\) unique items, enough to encode all the characters on an English language keyboard. However, other languages use characters not included in this encoding. For example, the é in México is not encoded by ASCII. For this reason, a new encoding, using more than 7 bits, was defined: Unicode. When using Unicode, one can chose between 8, 16, and 32 bits abbreviated UTF-8, UTF-16, and UTF-32 respectively. RStudio defaults to UTF-8 encoding. ASCII is a subset of UTF-8.
Try reading in this file:
url <- "https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/calificaciones.csv"
readLines(url, n = 2)[1] "\"nombre\",\"f.n.\",\"estampa\",\"puntuaci\xf3n\""
[2] "\"Beyonc\xe9\",\"04 de septiembre de 1981\",2023-09-22 02:11:02,\"87,5\""
When you see these weird characters the problem is almost always that you are assuming the wrong encoding. You need to be a hacker to figure out, readr has a function that tries:
guess_encoding(url)# A tibble: 3 × 2
encoding confidence
<chr> <dbl>
1 ISO-8859-1 0.92
2 ISO-8859-2 0.72
3 ISO-8859-9 0.53
The first guess makes sense as Spanish is often saved using Latin-1 encoding, also known as ISO-8859 encoding because it was the first to include accents and other characters used in Spanish. Once we figure this out we can read in the file correctly:
read_csv(url, locale = locale(encoding = "ISO-8859-1", decimal_mark = ","))Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): nombre, f.n.
dbl (1): puntuación
dttm (1): estampa
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 7 × 4
nombre f.n. estampa puntuación
<chr> <chr> <dttm> <dbl>
1 Beyoncé 04 de septiembre de 1981 2023-09-22 02:11:02 87.5
2 Blümchen 20 de abril de 1980 2023-09-22 03:23:05 99
3 João 10 de junio de 1931 2023-09-21 22:43:28 98.9
4 López 24 de julio de 1969 2023-09-22 01:06:59 88.7
5 Ñengo 15 de diciembre de 1981 2023-09-21 23:35:37 93.1
6 Plácido 24 de enero de 1941 2023-09-21 23:17:21 88.7
7 Thalía 26 de agosto de 1971 2023-09-21 23:08:02 83
[1] TRUE
8.5 Exercises
- Use the
read_csvfunction to read each of the csv files that the following code saves in thefilesobject. Hint: use thepatterninlist.filesto keep only the csv files.
library(readr)
path <- system.file("extdata", package = "dslabs")
files <- list.files(path, pattern = ".csv")
res <- lapply(files, function(fn)
read_csv(file.path(path, fn), show_col_types = FALSE))New names:
• `` -> `...1`
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
- Note that you get a warning. To see which one you can run it one-by-one in a loop:
for (i in seq_along(files)) {
print(files[i])
read_csv(file.path(path, files[i]), show_col_types = FALSE)
}[1] "calificaciones.csv"
[1] "carbon_emissions.csv"
[1] "fertility-two-countries-example.csv"
[1] "life-expectancy-and-fertility-two-countries-example.csv"
[1] "murders.csv"
[1] "olive.csv"
New names:
• `` -> `...1`
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
[1] "ssa-death-probability.csv"
olive.csv gives us a New names warning. This is because the first line of the file is missing the header for the first column.
read_lines(file.path(path, "olive.csv"), n_max = 2)[1] ",Region,Area,palmitic,palmitoleic,stearic,oleic,linoleic,linolenic,arachidic,eicosenoic"
[2] "1,North-Apulia,1,1,1075,75,226,7823,672,36,60,29"
Read the help file for read_csv to figure out how to read in the file without reading this header. If you skip the header, you should not get this warning. Save the result to an object called dat.
read_csv(file.path(path, "olive.csv"), col_names = FALSE, skip = 1)Rows: 572 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): X2
dbl (11): X1, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 572 × 12
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 North-Apul… 1 1 1075 75 226 7823 672 36 60 29
2 2 North-Apul… 1 1 1088 73 224 7709 781 31 61 29
3 3 North-Apul… 1 1 911 54 246 8113 549 31 63 29
4 4 North-Apul… 1 1 966 57 240 7952 619 50 78 35
5 5 North-Apul… 1 1 1051 67 259 7771 672 50 80 46
6 6 North-Apul… 1 1 911 49 268 7924 678 51 70 44
7 7 North-Apul… 1 1 922 66 264 7990 618 49 56 29
8 8 North-Apul… 1 1 1100 61 235 7728 734 39 64 35
9 9 North-Apul… 1 1 1082 60 239 7745 709 46 83 33
10 10 North-Apul… 1 1 1037 55 213 7944 633 26 52 30
# ℹ 562 more rows
- A problem with the previous approach is that we don’t know what the columns represent. Type
names(dat)to see that the names are not informative. Use theread_lineswith argumentn_max=1to read just the first line.
read_lines(file.path(path, "olive.csv"), n_max = 1)[1] ",Region,Area,palmitic,palmitoleic,stearic,oleic,linoleic,linolenic,arachidic,eicosenoic"
Notice that you can use this to assign names to the data frame.
colnames <- read_lines(file.path(path, "olive.csv"), n_max = 1)
colnames <- strsplit(colnames, ",") |> unlist()
colnames[1] <- "row_number"
names(dat) <- colnamesWarning: The `value` argument of `names<-` must have the same length as `x` as of tibble
3.0.0.