<- system.file("extdata", package = "dslabs") dir
8 Importing data
8.1 R base functions
We include example data files for practice in the dslabs package. They are stored here:
Take a look:
list.files(dir)
[1] "2010_bigfive_regents.xls"
[2] "calificaciones.csv"
[3] "carbon_emissions.csv"
[4] "fertility-two-countries-example.csv"
[5] "HRlist2.txt"
[6] "life-expectancy-and-fertility-two-countries-example.csv"
[7] "murders.csv"
[8] "olive.csv"
[9] "RD-Mortality-Report_2015-18-180531.pdf"
[10] "ssa-death-probability.csv"
Copy one of them to your working directory:
<- file.path(dir, "murders.csv")
file_path file.copy(file_path, "murders.csv")
[1] TRUE
The file.path
function combines characters to form a complete path, ensuring compatibility with the respective operating system. Linux and Mac use forward slashes /
, while Windows uses backslashes \
, to separate directories. This function is useful because often you want to define paths using a variable.
The file.copy
function copies a file and returns TRUE
if succesful. If the file exists it will not copy.
What kind of file is it? Although the suffix usually tells us what type of file it is, there is no guarantee that these always match.
readLines("murders.csv", n = 3)
[1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"
[3] "Alaska,AK,West,710231,19"
It is comma delimited and has a header. You can import it like this:
<- read.csv("murders.csv") dat
There are other importing function in base R: read.table
, read.csv
and read.delim
, for example.
8.2 The readr and readxl packages
Tidyverse has improved versions of functions for importing data.
8.2.1 readr
The readr package includes functions for reading data stored in text file spreadsheets into R. readr is part of the tidyverse package, but you can load it directly using:
library(readr)
The following functions are available to read-in spreadsheets:
Function | Format | Typical suffix |
---|---|---|
read_table | white space separated values | txt |
read_csv | comma separated values | csv |
read_csv2 | semicolon separated values | csv |
read_tsv | tab delimited separated values | tsv |
read_delim | general text file format, must define delimiter | txt |
the readr equivalent of readLines
is read_lines
:
read_lines("murders.csv", n_max = 3)
[1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"
[3] "Alaska,AK,West,710231,19"
From the .csv suffix and the peek at the file, we know to use read_csv
:
<- read_csv("murders.csv") dat
Rows: 51 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): state, abb, region
dbl (2): population, total
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note that we receive a message letting us know what data types were used for each column. Also note that dat
is a tibble
, not just a data frame. This is because read_csv
is a tidyverse parser.
A powerful added feature of read_csv
is the col_type
arguments that let’s you specify the data type of each column before reading. This can help with parsing dates or not letting an error like a letter in a column of numbers turn everything into a character.
8.2.2 readxl
Many spreadsheets are saved in Microsoft Excel format. For this we use parsers in the readxl package:
library(readxl)
The package provides functions to read-in Microsoft Excel formats:
Function | Format | Typical suffix |
---|---|---|
read_excel | auto detect the format | xls, xlsx |
read_xls | original format | xls |
read_xlsx | new format | xlsx |
The Microsoft Excel formats permit you to have more than one spreadsheet in one file. These are referred to as sheets. The functions listed above read the first sheet by default, but we can also read the others. The excel_sheets
function gives us the names of all the sheets in an Excel file. These names can then be passed to the sheet
argument in the three functions above to read sheets other than the first.
8.3 Downloading files
A common place for data to reside is on the internet. When these data are in files, we can download them and then import them or even read them directly from the web.
<-
url "https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/murders.csv"
The read_csv
file can read these files directly:
<- read_csv(url) dat
You can also download the file first using download.file
or the Unix commands curl
or wget
.
<- tempfile()
tmp_filename download.file(url, tmp_filename)
<- read_csv(tmp_filename) dat
Rows: 51 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): state, abb, region
dbl (2): population, total
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
file.remove(tmp_filename)
[1] TRUE
8.4 Encoding
RStudio assumes the Unicode encoding. A common pitfall in data analysis is assuming a file is Unicode when, in fact, it is something else.
To understand encoding, remember that everything on a computer needs to eventually be converted to 0s and 1s. ASCII is an encoding that maps characters to numbers. ASCII uses 7 bits (0s and 1s) which results in \(2^7 = 128\) unique items, enough to encode all the characters on an English language keyboard. However, other languages use characters not included in this encoding. For example, the é in México is not encoded by ASCII. For this reason, a new encoding, using more than 7 bits, was defined: Unicode. When using Unicode, one can chose between 8, 16, and 32 bits abbreviated UTF-8, UTF-16, and UTF-32 respectively. RStudio defaults to UTF-8 encoding. ASCII is a subset of UTF-8.
Try reading in this file:
<- "https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/calificaciones.csv"
url readLines(url, n = 2)
[1] "\"nombre\",\"f.n.\",\"estampa\",\"puntuaci\xf3n\""
[2] "\"Beyonc\xe9\",\"04 de septiembre de 1981\",2023-09-22 02:11:02,\"87,5\""
When you see these weird characters the problem is almost always that you are assuming the wrong encoding. You need to be a hacker to figure out, readr has a function that tries:
guess_encoding(url)
# A tibble: 3 × 2
encoding confidence
<chr> <dbl>
1 ISO-8859-1 0.92
2 ISO-8859-2 0.72
3 ISO-8859-9 0.53
The first guess makes sense as Spanish is often saved using Latin-1
encoding, also known as ISO-8859
encoding because it was the first to include accents and other characters used in Spanish. Once we figure this out we can read in the file correctly:
read_csv(url, locale = locale(encoding = "ISO-8859-1", decimal_mark = ","))
Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): nombre, f.n.
dbl (1): puntuación
dttm (1): estampa
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 7 × 4
nombre f.n. estampa puntuación
<chr> <chr> <dttm> <dbl>
1 Beyoncé 04 de septiembre de 1981 2023-09-22 02:11:02 87.5
2 Blümchen 20 de abril de 1980 2023-09-22 03:23:05 99
3 João 10 de junio de 1931 2023-09-21 22:43:28 98.9
4 López 24 de julio de 1969 2023-09-22 01:06:59 88.7
5 Ñengo 15 de diciembre de 1981 2023-09-21 23:35:37 93.1
6 Plácido 24 de enero de 1941 2023-09-21 23:17:21 88.7
7 Thalía 26 de agosto de 1971 2023-09-21 23:08:02 83
[1] TRUE
8.5 Exercises
- Use the
read_csv
function to read each of the csv files that the following code saves in thefiles
object. Hint: use thepattern
inlist.files
to keep only the csv files.
library(readr)
<- system.file("extdata", package = "dslabs")
path <- list.files(path, pattern = ".csv")
files <- lapply(files, function(fn)
res read_csv(file.path(path, fn), show_col_types = FALSE))
New names:
• `` -> `...1`
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
- Note that you get a warning. To see which one you can run it one-by-one in a loop:
for (i in seq_along(files)) {
print(files[i])
read_csv(file.path(path, files[i]), show_col_types = FALSE)
}
[1] "calificaciones.csv"
[1] "carbon_emissions.csv"
[1] "fertility-two-countries-example.csv"
[1] "life-expectancy-and-fertility-two-countries-example.csv"
[1] "murders.csv"
[1] "olive.csv"
New names:
• `` -> `...1`
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
[1] "ssa-death-probability.csv"
olive.csv
gives us a New names
warning. This is because the first line of the file is missing the header for the first column.
read_lines(file.path(path, "olive.csv"), n_max = 2)
[1] ",Region,Area,palmitic,palmitoleic,stearic,oleic,linoleic,linolenic,arachidic,eicosenoic"
[2] "1,North-Apulia,1,1,1075,75,226,7823,672,36,60,29"
Read the help file for read_csv
to figure out how to read in the file without reading this header. If you skip the header, you should not get this warning. Save the result to an object called dat
.
read_csv(file.path(path, "olive.csv"), col_names = FALSE, skip = 1)
Rows: 572 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): X2
dbl (11): X1, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 572 × 12
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 North-Apul… 1 1 1075 75 226 7823 672 36 60 29
2 2 North-Apul… 1 1 1088 73 224 7709 781 31 61 29
3 3 North-Apul… 1 1 911 54 246 8113 549 31 63 29
4 4 North-Apul… 1 1 966 57 240 7952 619 50 78 35
5 5 North-Apul… 1 1 1051 67 259 7771 672 50 80 46
6 6 North-Apul… 1 1 911 49 268 7924 678 51 70 44
7 7 North-Apul… 1 1 922 66 264 7990 618 49 56 29
8 8 North-Apul… 1 1 1100 61 235 7728 734 39 64 35
9 9 North-Apul… 1 1 1082 60 239 7745 709 46 83 33
10 10 North-Apul… 1 1 1037 55 213 7944 633 26 52 30
# ℹ 562 more rows
- A problem with the previous approach is that we don’t know what the columns represent. Type
names(dat)
to see that the names are not informative. Use theread_lines
with argumentn_max=1
to read just the first line.
read_lines(file.path(path, "olive.csv"), n_max = 1)
[1] ",Region,Area,palmitic,palmitoleic,stearic,oleic,linoleic,linolenic,arachidic,eicosenoic"
Notice that you can use this to assign names to the data frame.
<- read_lines(file.path(path, "olive.csv"), n_max = 1)
colnames <- strsplit(colnames, ",") |> unlist()
colnames 1] <- "row_number"
colnames[names(dat) <- colnames
Warning: The `value` argument of `names<-` must have the same length as `x` as of tibble
3.0.0.