Locales

2024-09-30

Locales

Computer settings change depending on language and location, and being unaware of this possibility can make certain data wrangling challenges difficult to overcome.

Locales

The purpose of locales is to group together common settings that can affect:
1. Month and day names, which are necessary for interpreting dates.
2. The standard date format.
3. The default time zone.
4. Character encoding, vital for reading non-ASCII characters.
5. The symbols for decimals and number groupings, important for interpreting numerical values.

Locales

In R, a locale refers to a suite of settings that dictate how the system should behave with respect to cultural conventions.
These affect the way data is formatted and presented, including date formatting, currency symbols, decimal separators, and other related aspects.
Locales in R affect several areas, including how character vectors are sorted.
Additionally, errors, warnings, and other messages might be translated into languages other than English based on the locale.

Locales in R

To access the current locale settings in R, you can use the Sys.getlocale() function:

Sys.getlocale()

[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

Locales in R

To set a specific locale, use the Sys.setlocale() function.
For example, to set the locale to US English:

Sys.setlocale("LC_ALL", "en_US.UTF-8")

[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

The exact string to use for setting the locale (like “en_US.UTF-8”) can depend on your operating system and its configuration.

Locales in R

LC_ALL refers to all locale categories.
R breaks down the locale into categories:
- LC_COLLATE: for string collation.
- LC_TIME: date and time formatting.
- LC_MONETARY: currency formatting.
- LC_MESSAGES: system message translations.
- LC_NUMERIC: number formatting.
You can set the locale for each category individually if you don’t want to use LC_ALL.

Locales in R

Warning

We have shown tools to control locales.
These settings are important because they affect how your data looks and behaves.
However, not all of these settings are available on every computer; their availability depends on what kind of computer you have and how it’s set up.
Changing these settings, especially LC_NUMERIC, can lead to unexpected problems when you’re working with numbers in R.
These locale settings only last as long as one R session.

The `locale` function

The readr package includes a locale() function that can be used to learn or change the current locale from within R:

library(readr) 
locale()

<locale>
Numbers:  123,456.78
Formats:  %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
        (Thu), Friday (Fri), Saturday (Sat)
Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
        June (Jun), July (Jul), August (Aug), September (Sep), October
        (Oct), November (Nov), December (Dec)
AM/PM:  AM/PM

The `locale` function

You can see all the locales available on your system by typing:

system("locale -a")

The `locale` function

Here is what you obtain if you change the dates locale to Spanish:

locale(date_names = "es")

<locale>
Numbers:  123,456.78
Formats:  %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days:   domingo (dom.), lunes (lun.), martes (mar.), miércoles (mié.), jueves
        (jue.), viernes (vie.), sábado (sáb.)
Months: enero (ene.), febrero (feb.), marzo (mar.), abril (abr.), mayo (may.),
        junio (jun.), julio (jul.), agosto (ago.), septiembre (sept.),
        octubre (oct.), noviembre (nov.), diciembre (dic.)
AM/PM:  a. m./p. m.

Example

Earlier we noted that reading the file:

fn <- file.path(system.file("extdata", package = "dslabs"), "calificaciones.csv")

had a encoding different than UTF-8, the default.

Example

We used guess_encoding to determine the correct one:

guess_encoding(fn)$encoding[1]

[1] "ISO-8859-1"

and used the locale function to change this and read in this encoding instead:

dat <- read_csv(fn, locale = locale(encoding = "ISO-8859-1"))

Example

This file provides homework assignment scores for seven students. Columns represent the name, date of birth, the time they submitted their assignment, and their score:

read_lines(fn, locale = locale(encoding = "ISO-8859-1"))

[1] "\"nombre\",\"f.n.\",\"estampa\",\"puntuación\""                       
[2] "\"Beyoncé\",\"04 de septiembre de 1981\",2023-09-22 02:11:02,\"87,5\""
[3] "\"Blümchen\",\"20 de abril de 1980\",2023-09-22 03:23:05,\"99,0\""    
[4] "\"João\",\"10 de junio de 1931\",2023-09-21 22:43:28,\"98,9\""        
[5] "\"López\",\"24 de julio de 1969\",2023-09-22 01:06:59,\"88,7\""       
[6] "\"Ñengo\",\"15 de diciembre de 1981\",2023-09-21 23:35:37,\"93,1\""   
[7] "\"Plácido\",\"24 de enero de 1941\",2023-09-21 23:17:21,\"88,7\""     
[8] "\"Thalía\",\"26 de agosto de 1971\",2023-09-21 23:08:02,\"83,0\""

Example

As an illustrative example, we will write code to compute the students age and check if they turned in their assignment by the deadline of September 21, 2023, before midnight.
We can read in the file with correct encoding like this:

dat <- read_csv(fn, locale = locale(encoding = "ISO-8859-1"))

Example

However, notice that the last column, which is supposed to contain exam scores between 0 and 100, shows numbers larger than 800:

dat$puntuación

[1] 875 990 989 887 931 887 830

Example

This happens because the scores in the file use the European decimal point, which confuses read_csv.
To address this issue, we can also change the encoding to use European decimals, which fixes the problem:

dat <- read_csv(fn, locale = locale(decimal_mark = ",", 
                                    encoding = "ISO-8859-1")) 
dat$puntuación

[1] 87.5 99.0 98.9 88.7 93.1 88.7 83.0

Example

Now, to compute the student ages, let’s try changing the submission times to date format:

library(lubridate) 
dmy(dat$f.n.)

[1] NA NA NA NA NA NA NA

Nothing gets converted correctly.
This is because the dates are in Spanish.

Example

We can change the locale to use Spanish as the language for dates:

parse_date(dat$f.n., format = "%d de %B de %Y", locale = locale(date_names = "es"))

[1] "1981-09-04" "1980-04-20" "1931-06-10" "1969-07-24" "1981-12-15"
[6] "1941-01-24" "1971-08-26"

Example

We can also reread the file using the correct locales:

dat <- read_csv(fn, locale = locale(date_names = "es", 
                                    date_format = "%d de %B de %Y", 
                                    decimal_mark = ",", 
                                    encoding = "ISO-8859-1"))

Example

Computing the students’ ages is now straightforward:

time_length(today() - dat$f.n., unit = "years") |> floor()

[1] 43 44 93 55 42 83 53

Example

Let’s check which students turned in their homework past the deadline of September 22:

dat$estampa >= make_date(2023, 9, 22)

[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE

We see that two students where late.
However, with times we have to be particularly careful as some functions default to the UTC timezone:

tz(dat$estampa)

[1] "UTC"

Example

If we change to the timezone to Eastern Standard Time (EST), we see no one was late:

with_tz(dat$estampa, tz =  "EST") >= make_date(2023, 9, 22)

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Locales

Locales

Locales

Locales

Locales in R

Locales in R

Locales in R

Locales in R

The locale function

The locale function

The locale function

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

The `locale` function

The `locale` function

The `locale` function