Locales

2024-09-30

Locales

  • Computer settings change depending on language and location, and being unaware of this possibility can make certain data wrangling challenges difficult to overcome.

Locales

  • The purpose of locales is to group together common settings that can affect:

    1. Month and day names, which are necessary for interpreting dates.

    2. The standard date format.

    3. The default time zone.

    4. Character encoding, vital for reading non-ASCII characters.

    5. The symbols for decimals and number groupings, important for interpreting numerical values.

Locales

  • In R, a locale refers to a suite of settings that dictate how the system should behave with respect to cultural conventions.

  • These affect the way data is formatted and presented, including date formatting, currency symbols, decimal separators, and other related aspects.

  • Locales in R affect several areas, including how character vectors are sorted.

  • Additionally, errors, warnings, and other messages might be translated into languages other than English based on the locale.

Locales in R

  • To access the current locale settings in R, you can use the Sys.getlocale() function:
Sys.getlocale() 
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

Locales in R

  • To set a specific locale, use the Sys.setlocale() function.

  • For example, to set the locale to US English:

Sys.setlocale("LC_ALL", "en_US.UTF-8") 
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
  • The exact string to use for setting the locale (like “en_US.UTF-8”) can depend on your operating system and its configuration.

Locales in R

  • LC_ALL refers to all locale categories.

  • R breaks down the locale into categories:

    • LC_COLLATE: for string collation.

    • LC_TIME: date and time formatting.

    • LC_MONETARY: currency formatting.

    • LC_MESSAGES: system message translations.

    • LC_NUMERIC: number formatting.

  • You can set the locale for each category individually if you don’t want to use LC_ALL.

Locales in R

Warning

  • We have shown tools to control locales.

  • These settings are important because they affect how your data looks and behaves.

  • However, not all of these settings are available on every computer; their availability depends on what kind of computer you have and how it’s set up.

  • Changing these settings, especially LC_NUMERIC, can lead to unexpected problems when you’re working with numbers in R.

  • These locale settings only last as long as one R session.

The locale function

  • The readr package includes a locale() function that can be used to learn or change the current locale from within R:
library(readr) 
locale() 
<locale>
Numbers:  123,456.78
Formats:  %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
        (Thu), Friday (Fri), Saturday (Sat)
Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
        June (Jun), July (Jul), August (Aug), September (Sep), October
        (Oct), November (Nov), December (Dec)
AM/PM:  AM/PM

The locale function

  • You can see all the locales available on your system by typing:
system("locale -a") 

The locale function

  • Here is what you obtain if you change the dates locale to Spanish:
locale(date_names = "es") 
<locale>
Numbers:  123,456.78
Formats:  %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days:   domingo (dom.), lunes (lun.), martes (mar.), miércoles (mié.), jueves
        (jue.), viernes (vie.), sábado (sáb.)
Months: enero (ene.), febrero (feb.), marzo (mar.), abril (abr.), mayo (may.),
        junio (jun.), julio (jul.), agosto (ago.), septiembre (sept.),
        octubre (oct.), noviembre (nov.), diciembre (dic.)
AM/PM:  a. m./p. m.

Example

  • Earlier we noted that reading the file:
fn <- file.path(system.file("extdata", package = "dslabs"), "calificaciones.csv") 
  • had a encoding different than UTF-8, the default.

Example

  • We used guess_encoding to determine the correct one:
guess_encoding(fn)$encoding[1] 
[1] "ISO-8859-1"
  • and used the locale function to change this and read in this encoding instead:
dat <- read_csv(fn, locale = locale(encoding = "ISO-8859-1")) 

Example

  • This file provides homework assignment scores for seven students. Columns represent the name, date of birth, the time they submitted their assignment, and their score:
read_lines(fn, locale = locale(encoding = "ISO-8859-1")) 
[1] "\"nombre\",\"f.n.\",\"estampa\",\"puntuación\""                       
[2] "\"Beyoncé\",\"04 de septiembre de 1981\",2023-09-22 02:11:02,\"87,5\""
[3] "\"Blümchen\",\"20 de abril de 1980\",2023-09-22 03:23:05,\"99,0\""    
[4] "\"João\",\"10 de junio de 1931\",2023-09-21 22:43:28,\"98,9\""        
[5] "\"López\",\"24 de julio de 1969\",2023-09-22 01:06:59,\"88,7\""       
[6] "\"Ñengo\",\"15 de diciembre de 1981\",2023-09-21 23:35:37,\"93,1\""   
[7] "\"Plácido\",\"24 de enero de 1941\",2023-09-21 23:17:21,\"88,7\""     
[8] "\"Thalía\",\"26 de agosto de 1971\",2023-09-21 23:08:02,\"83,0\""     

Example

  • As an illustrative example, we will write code to compute the students age and check if they turned in their assignment by the deadline of September 21, 2023, before midnight.

  • We can read in the file with correct encoding like this:

dat <- read_csv(fn, locale = locale(encoding = "ISO-8859-1")) 

Example

  • However, notice that the last column, which is supposed to contain exam scores between 0 and 100, shows numbers larger than 800:
dat$puntuación 
[1] 875 990 989 887 931 887 830

Example

  • This happens because the scores in the file use the European decimal point, which confuses read_csv.

  • To address this issue, we can also change the encoding to use European decimals, which fixes the problem:

dat <- read_csv(fn, locale = locale(decimal_mark = ",", 
                                    encoding = "ISO-8859-1")) 
dat$puntuación 
[1] 87.5 99.0 98.9 88.7 93.1 88.7 83.0

Example

  • Now, to compute the student ages, let’s try changing the submission times to date format:
library(lubridate) 
dmy(dat$f.n.) 
[1] NA NA NA NA NA NA NA
  • Nothing gets converted correctly.

  • This is because the dates are in Spanish.

Example

We can change the locale to use Spanish as the language for dates:

parse_date(dat$f.n., format = "%d de %B de %Y", locale = locale(date_names = "es")) 
[1] "1981-09-04" "1980-04-20" "1931-06-10" "1969-07-24" "1981-12-15"
[6] "1941-01-24" "1971-08-26"

Example

We can also reread the file using the correct locales:

dat <- read_csv(fn, locale = locale(date_names = "es", 
                                    date_format = "%d de %B de %Y", 
                                    decimal_mark = ",", 
                                    encoding = "ISO-8859-1")) 

Example

Computing the students’ ages is now straightforward:

time_length(today() - dat$f.n., unit = "years") |> floor() 
[1] 43 44 93 55 42 83 53

Example

  • Let’s check which students turned in their homework past the deadline of September 22:
dat$estampa >= make_date(2023, 9, 22) 
[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE
  • We see that two students where late.

  • However, with times we have to be particularly careful as some functions default to the UTC timezone:

tz(dat$estampa) 
[1] "UTC"

Example

  • If we change to the timezone to Eastern Standard Time (EST), we see no one was late:
with_tz(dat$estampa, tz =  "EST") >= make_date(2023, 9, 22) 
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE