15  Locales

Notice the character on this file.

fn <- file.path(system.file("extdata", package = "dslabs"), "calificaciones.csv")
readLines(fn)
[1] "\"nombre\",\"f.n.\",\"estampa\",\"puntuaci\xf3n\""                       
[2] "\"Beyonc\xe9\",\"04 de septiembre de 1981\",2023-09-22 02:11:02,\"87,5\""
[3] "\"Bl\xfcmchen\",\"20 de abril de 1980\",2023-09-22 03:23:05,\"99,0\""    
[4] "\"Jo\xe3o\",\"10 de junio de 1931\",2023-09-21 22:43:28,\"98,9\""        
[5] "\"L\xf3pez\",\"24 de julio de 1969\",2023-09-22 01:06:59,\"88,7\""       
[6] "\"\xd1engo\",\"15 de diciembre de 1981\",2023-09-21 23:35:37,\"93,1\""   
[7] "\"Pl\xe1cido\",\"24 de enero de 1941\",2023-09-21 23:17:21,\"88,7\""     
[8] "\"Thal\xeda\",\"26 de agosto de 1971\",2023-09-21 23:08:02,\"83,0\""     

The unrecognizable characters actually lead to read.csv failing:

try({x <- read.csv(fn)})
Error in make.names(col.names, unique = TRUE) : 
  invalid multibyte string 4

This is because it is not UTF encoding, which is the default:

Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

The locale is a group of information about your system. This includes the encoding, the language, and the time zone. This can affect how data is read into R. A mismatch of encodings creates weird problems often without warning or error.

You can use the stri_enc_detect function in the stringi package to predict the encoding of a character:

library(stringi)
x <- readLines(fn, n = 1)
stri_enc_detect(x)
[[1]]
    Encoding Language Confidence
1 ISO-8859-1       es       0.75
2 ISO-8859-2       cs       0.18
3   UTF-16BE                0.10
4   UTF-16LE                0.10
5  Shift_JIS       ja       0.10
6    GB18030       zh       0.10
7       Big5       zh       0.10

We can also use this readr function to detect encoding of files:

library(readr)
guess_encoding(fn)
# A tibble: 3 × 2
  encoding   confidence
  <chr>           <dbl>
1 ISO-8859-1       0.92
2 ISO-8859-2       0.72
3 ISO-8859-9       0.53

The read_csv permits us to define elements of the encoding through the locale argument. It switches the local only temporarily, while running the parser read_csv. The locale for R remains the same after calling this.

x <- read_csv(fn, locale = locale(encoding = "ISO-8859-1"))
Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): nombre, f.n.
num  (1): puntuación
dttm (1): estampa

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
x
# A tibble: 7 × 4
  nombre   f.n.                     estampa             puntuación
  <chr>    <chr>                    <dttm>                   <dbl>
1 Beyoncé  04 de septiembre de 1981 2023-09-22 02:11:02        875
2 Blümchen 20 de abril de 1980      2023-09-22 03:23:05        990
3 João     10 de junio de 1931      2023-09-21 22:43:28        989
4 López    24 de julio de 1969      2023-09-22 01:06:59        887
5 Ñengo    15 de diciembre de 1981  2023-09-21 23:35:37        931
6 Plácido  24 de enero de 1941      2023-09-21 23:17:21        887
7 Thalía   26 de agosto de 1971     2023-09-21 23:08:02        830

Now notice the last column. Compare it to what we saw with readLines. They were numbers that used the European decimal point. This confuses read_csv. We can also change the encoding so the Europearn decimals are used.

x <- read_csv(fn, locale = locale(encoding = "ISO-8859-1", decimal_mark = ","))
Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): nombre, f.n.
dbl  (1): puntuación
dttm (1): estampa

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
x
# A tibble: 7 × 4
  nombre   f.n.                     estampa             puntuación
  <chr>    <chr>                    <dttm>                   <dbl>
1 Beyoncé  04 de septiembre de 1981 2023-09-22 02:11:02       87.5
2 Blümchen 20 de abril de 1980      2023-09-22 03:23:05       99  
3 João     10 de junio de 1931      2023-09-21 22:43:28       98.9
4 López    24 de julio de 1969      2023-09-22 01:06:59       88.7
5 Ñengo    15 de diciembre de 1981  2023-09-21 23:35:37       93.1
6 Plácido  24 de enero de 1941      2023-09-21 23:17:21       88.7
7 Thalía   26 de agosto de 1971     2023-09-21 23:08:02       83  

Now let’s try to change the dates to date format:

library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
dmy(x$f.n.)
Warning: All formats failed to parse. No formats found.
[1] NA NA NA NA NA NA NA

Nothing gets correctly converted. This is because the dates are in Spanish. You can change the locale to use Spanish as the language:

parse_date(x$f.n., format = "%d de %B de %Y", locale = locale(date_names = "es"))
[1] "1981-09-04" "1980-04-20" "1931-06-10" "1969-07-24" "1981-12-15"
[6] "1941-01-24" "1971-08-26"

Finally notice that two students turned in the homework past the deadline of September 21

x$estampa >= make_date(2023, 9, 22)
[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE

However, with times we have to be particularly careful as some functions default to UTC.

tz(x$estampa)
[1] "UTC"

But these times are in the default GMT. If we change to out timezone:

with_tz(x$estampa, tz =  Sys.timezone()) >= make_date(2023, 9, 22)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

we see everybody turned it in on time.