dir <- system.file(package = "dslabs")
file_path <- file.path(dir, "extdata/murders.csv")
file.copy(file_path, "murders.csv")
[1] TRUE
2024-09-30
One of the most common ways of storing and sharing data is through electronic spreadsheets.
A spreadsheet is a file version of a data frame.
But there are many ways to store spreadsheets in files.
To import data we need to:
Identify the file’s location.
Know what function or parsers to use.
For the second step it helps to know the file type and encoding.
Files can generally be classified into two categories: text and binary.
We describe the most widely used format for storing data for both these types and learn how to identify them.
You have already worked with text files: R scripts and Quarto files, for example.
dslabs offers examples:
An advantage of text files is that we can easily “look” at them without having to purchase any kind of special software or follow complicated instructions.
Exercise:
murders.csv
into your working directory and examine it with less
.Line breaks are used to separate rows and a delimiter to separate columns within a row.
The most common delimiters are comma (,
), semicolon (;
), space (), and tab (
\t
).
Different parsers are used to read these files, so we need to know what delimiter was used.
In some cases, the delimiter can be inferred from file suffix: csv
, tsv
, for example.
But we recommend looking at the file rather than inferring from the suffix.
readLines
function:Opening image files such as jpg or png in a text editor or using readLines
in R will not show comprehensible content: these are binary files.
Unlike text files, which are designed for human readability and have standardized conventions, binary files have many formats specific to their data type.
While R’s readBin
function can process any binary file, interpreting the output necessitates a thorough understanding of the file’s structure.
We focus on the the most prevalent binary formats for spreadsheets: Microsoft Excel xls and xlsx.
Here example of useful R base parsers:
readr provides alternatives that produce tibbles:
Notice the messages produced.
Function | Format | Suffix |
---|---|---|
read_table | white space separated values | txt |
read_csv | comma separated values | csv |
read_csv2 | semicolon separated values | csv |
read_tsv | tab separated values | tsv |
read_delim | must define delimiter | txt |
read_lines | similar to readLines |
any file |
For Excel files you can use the readxl package.
You can read specific sheets and see them using
Note that read_xls
has a sheet argument.
Function | Format | Suffix |
---|---|---|
read_excel | auto detect the format | xls, xlsx |
read_xls | original format | xls |
read_xlsx | new format | xlsx |
excel_sheets | detects sheets | xls, xlsx |
The data.table package provide a very fast parser:
Note: It returns a file in data.table
format which we have mentioned but not explained.
The scan
function is the most general parser.
It will read in any text file and return a vector so you are on your own coverting it to a data frame.
Because it returns a vector, you need to tell it in advance what data type to expect:
[1] "state" "abb" "region" "population" "total"
[6] "Alabama" "AL" "South" "4779736" "135"
scan()
. Hit return to stop.Computer translates everything into 0s and 1s.
ASCII is an encoding system that assigns specific numbers to characters.
Using 7 bits, ASCII can represent \(2^7 = 128\) unique symbols, sufficient for all English keyboard characters.
However, many global languages contain characters outside ASCII’s range.
For instance, the é in “México” isn’t in ASCII’s catalog.
To address this, broader encodings emerged.
Unicode offers variations using 8, 16, or 32 bits, known as UTF-8, UTF-16, and UTF-32.
RStudio typically uses UTF-8 as its default.
Notably, ASCII is a subset of UTF-8, meaning that if a file is ASCII-encoded, presuming it’s UTF-8 encoded won’t cause issues.
However, there other encodings, such as ISO-8859-1 (also known as Latin-1) developed for the western European languages, Big5 for Traditional Chinese, and ISO-8859-6 for Arabic.
Take a look at this file:
The readr parsers permit us to specify an encoding.
It also includes a function that tries to guess the encoding:
locale
argument:# A tibble: 7 × 4
nombre f.n. estampa puntuación
<chr> <chr> <dttm> <dbl>
1 Beyoncé 04 de septiembre de 1981 2023-09-22 02:11:02 875
2 Blümchen 20 de abril de 1980 2023-09-22 03:23:05 990
3 João 10 de junio de 1931 2023-09-21 22:43:28 989
4 López 24 de julio de 1969 2023-09-22 01:06:59 887
5 Ñengo 15 de diciembre de 1981 2023-09-21 23:35:37 931
6 Plácido 24 de enero de 1941 2023-09-21 23:17:21 887
7 Thalía 26 de agosto de 1971 2023-09-21 23:08:02 830
A common place for data to reside is on the internet.
We can download these files and then import them.
We can also read them directly from the web.
This will download the file and save it on your system with the name murders.csv
.
Note You can use any name here, not necessarily murders.csv
.
Warning
The function download.file
overwrites existing files without warning.
tempdir
and tempfile
.