Importing files

2024-09-30

Importing data

  • One of the most common ways of storing and sharing data is through electronic spreadsheets.

  • A spreadsheet is a file version of a data frame.

  • But there are many ways to store spreadsheets in files.

Importing data

To import data we need to:

  • Identify the file’s location.

  • Know what function or parsers to use.

For the second step it helps to know the file type and encoding.

File types

  • Files can generally be classified into two categories: text and binary.

  • We describe the most widely used format for storing data for both these types and learn how to identify them.

Text files

  • You have already worked with text files: R scripts and Quarto files, for example.

  • dslabs offers examples:

dir <- system.file(package = "dslabs") 
file_path <- file.path(dir, "extdata/murders.csv") 
file.copy(file_path, "murders.csv") 
[1] TRUE

Text files

  • An advantage of text files is that we can easily “look” at them without having to purchase any kind of special software or follow complicated instructions.

  • Exercise:

    • copy murders.csv into your working directory and examine it with less.
    • Then try the Open file RStudio tool.

Text files

  • Line breaks are used to separate rows and a delimiter to separate columns within a row.

  • The most common delimiters are comma (,), semicolon (;), space (), and tab (\t).

  • Different parsers are used to read these files, so we need to know what delimiter was used.

  • In some cases, the delimiter can be inferred from file suffix: csv, tsv, for example.

  • But we recommend looking at the file rather than inferring from the suffix.

Text files

  • In R, you can look at any number of lines from within R using the readLines function:
readLines("murders.csv", n = 3) 
[1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"     
[3] "Alaska,AK,West,710231,19"         

Binary files

  • Opening image files such as jpg or png in a text editor or using readLines in R will not show comprehensible content: these are binary files.

  • Unlike text files, which are designed for human readability and have standardized conventions, binary files have many formats specific to their data type.

  • While R’s readBin function can process any binary file, interpreting the output necessitates a thorough understanding of the file’s structure.

  • We focus on the the most prevalent binary formats for spreadsheets: Microsoft Excel xls and xlsx.

R Base Parsers

Here example of useful R base parsers:

x <- read.table("murders.csv", sep = "\t")
x <- read.csv("murders.csv")

readr Parsers

readr provides alternatives that produce tibbles:

library(readr)
x <- read_csv("murders.csv")
x <- read_delim("murders.csv", delim = "\t")

Notice the messages produced.

readr Parsers

Function Format Suffix
read_table white space separated values txt
read_csv comma separated values csv
read_csv2 semicolon separated values csv
read_tsv tab separated values tsv
read_delim must define delimiter txt
read_lines similar to readLines any file

readxl Parsers

For Excel files you can use the readxl package.

library(readxl)
fn <- file.path(dir, "extdata/2010_bigfive_regents.xls") 
y <- read_xls(fn)

readxl Parsers

You can read specific sheets and see them using

excel_sheets(fn)
[1] "Sheet1" "Sheet2" "Sheet3"

Note that read_xls has a sheet argument.

readxl Parsers

Function Format Suffix
read_excel auto detect the format xls, xlsx
read_xls original format xls
read_xlsx new format xlsx
excel_sheets detects sheets xls, xlsx

data.table Parsers

The data.table package provide a very fast parser:

library(data.table)
x <- fread("murders.csv")

Note: It returns a file in data.table format which we have mentioned but not explained.

scan

  • The scan function is the most general parser.

  • It will read in any text file and return a vector so you are on your own coverting it to a data frame.

  • Because it returns a vector, you need to tell it in advance what data type to expect:

scan("murders.csv", what = "c", sep = ",", n = 10)
 [1] "state"      "abb"        "region"     "population" "total"     
 [6] "Alabama"    "AL"         "South"      "4779736"    "135"       
  • It can also be used to read from the console. Try typing scan(). Hit return to stop.

Encoding

  • Computer translates everything into 0s and 1s.

  • ASCII is an encoding system that assigns specific numbers to characters.

  • Using 7 bits, ASCII can represent \(2^7 = 128\) unique symbols, sufficient for all English keyboard characters.

  • However, many global languages contain characters outside ASCII’s range.

Encoding

  • For instance, the é in “México” isn’t in ASCII’s catalog.

  • To address this, broader encodings emerged.

  • Unicode offers variations using 8, 16, or 32 bits, known as UTF-8, UTF-16, and UTF-32.

  • RStudio typically uses UTF-8 as its default.

  • Notably, ASCII is a subset of UTF-8, meaning that if a file is ASCII-encoded, presuming it’s UTF-8 encoded won’t cause issues.

Encoding

  • However, there other encodings, such as ISO-8859-1 (also known as Latin-1) developed for the western European languages, Big5 for Traditional Chinese, and ISO-8859-6 for Arabic.

  • Take a look at this file:

fn <- "calificaciones.csv" 
file.copy(file.path(system.file("extdata", package = "dslabs"), fn), fn) 
[1] TRUE
readLines(fn, n = 1) 
[1] "\"nombre\",\"f.n.\",\"estampa\",\"puntuaci\xf3n\""

Encoding

  • The readr parsers permit us to specify an encoding.

  • It also includes a function that tries to guess the encoding:

guess_encoding("murders.csv") 
# A tibble: 1 × 2
  encoding confidence
  <chr>         <dbl>
1 ASCII             1
guess_encoding("calificaciones.csv") 
# A tibble: 3 × 2
  encoding   confidence
  <chr>           <dbl>
1 ISO-8859-1       0.92
2 ISO-8859-2       0.72
3 ISO-8859-9       0.53

Encoding

  • Once we know the encoding we can specify it through the locale argument:
dat <- read_csv("calificaciones.csv", show_col_types = FALSE, 
                locale = locale(encoding = "ISO-8859-1")) 
  • We learn about locales in later.

Encoding

  • We can now see that the characters in the header were read in correctly:
dat
# A tibble: 7 × 4
  nombre   f.n.                     estampa             puntuación
  <chr>    <chr>                    <dttm>                   <dbl>
1 Beyoncé  04 de septiembre de 1981 2023-09-22 02:11:02        875
2 Blümchen 20 de abril de 1980      2023-09-22 03:23:05        990
3 João     10 de junio de 1931      2023-09-21 22:43:28        989
4 López    24 de julio de 1969      2023-09-22 01:06:59        887
5 Ñengo    15 de diciembre de 1981  2023-09-21 23:35:37        931
6 Plácido  24 de enero de 1941      2023-09-21 23:17:21        887
7 Thalía   26 de agosto de 1971     2023-09-21 23:08:02        830

Downloading files

  • A common place for data to reside is on the internet.

  • We can download these files and then import them.

  • We can also read them directly from the web.

url <- paste0("https://raw.githubusercontent.com/", 
              "rafalab/dslabs/master/inst/extdata/murders.csv") 
x <- read.csv(url) 

Downloading files

  • If you want a local copy, you can use `download.file:
download.file(url, "murders.csv") 
  • This will download the file and save it on your system with the name murders.csv.

  • Note You can use any name here, not necessarily murders.csv.

Downloading files

Warning

The function download.file overwrites existing files without warning.

  • Two functions that are sometimes useful when downloading data from the internet are tempdir and tempfile.
tmp_filename <- tempfile() 
download.file(url, tmp_filename) 
dat <- read_csv(tmp_filename) 
file.remove(tmp_filename)