[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
2024-09-30
The purpose of locales is to group together common settings that can affect:
Month and day names, which are necessary for interpreting dates.
The standard date format.
The default time zone.
Character encoding, vital for reading non-ASCII characters.
The symbols for decimals and number groupings, important for interpreting numerical values.
In R, a locale refers to a suite of settings that dictate how the system should behave with respect to cultural conventions.
These affect the way data is formatted and presented, including date formatting, currency symbols, decimal separators, and other related aspects.
Locales in R affect several areas, including how character vectors are sorted.
Additionally, errors, warnings, and other messages might be translated into languages other than English based on the locale.
Sys.getlocale()
function:To set a specific locale, use the Sys.setlocale()
function.
For example, to set the locale to US English:
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
LC_ALL
refers to all locale categories.
R breaks down the locale into categories:
LC_COLLATE
: for string collation.
LC_TIME
: date and time formatting.
LC_MONETARY
: currency formatting.
LC_MESSAGES
: system message translations.
LC_NUMERIC
: number formatting.
You can set the locale for each category individually if you don’t want to use LC_ALL
.
Warning
We have shown tools to control locales.
These settings are important because they affect how your data looks and behaves.
However, not all of these settings are available on every computer; their availability depends on what kind of computer you have and how it’s set up.
Changing these settings, especially LC_NUMERIC
, can lead to unexpected problems when you’re working with numbers in R.
These locale settings only last as long as one R session.
locale
functionlocale()
function that can be used to learn or change the current locale from within R:<locale>
Numbers: 123,456.78
Formats: %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
(Thu), Friday (Fri), Saturday (Sat)
Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
June (Jun), July (Jul), August (Aug), September (Sep), October
(Oct), November (Nov), December (Dec)
AM/PM: AM/PM
locale
functionlocale
function<locale>
Numbers: 123,456.78
Formats: %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days: domingo (dom.), lunes (lun.), martes (mar.), miércoles (mié.), jueves
(jue.), viernes (vie.), sábado (sáb.)
Months: enero (ene.), febrero (feb.), marzo (mar.), abril (abr.), mayo (may.),
junio (jun.), julio (jul.), agosto (ago.), septiembre (sept.),
octubre (oct.), noviembre (nov.), diciembre (dic.)
AM/PM: a. m./p. m.
guess_encoding
to determine the correct one:locale
function to change this and read in this encoding instead:[1] "\"nombre\",\"f.n.\",\"estampa\",\"puntuación\""
[2] "\"Beyoncé\",\"04 de septiembre de 1981\",2023-09-22 02:11:02,\"87,5\""
[3] "\"Blümchen\",\"20 de abril de 1980\",2023-09-22 03:23:05,\"99,0\""
[4] "\"João\",\"10 de junio de 1931\",2023-09-21 22:43:28,\"98,9\""
[5] "\"López\",\"24 de julio de 1969\",2023-09-22 01:06:59,\"88,7\""
[6] "\"Ñengo\",\"15 de diciembre de 1981\",2023-09-21 23:35:37,\"93,1\""
[7] "\"Plácido\",\"24 de enero de 1941\",2023-09-21 23:17:21,\"88,7\""
[8] "\"Thalía\",\"26 de agosto de 1971\",2023-09-21 23:08:02,\"83,0\""
As an illustrative example, we will write code to compute the students age and check if they turned in their assignment by the deadline of September 21, 2023, before midnight.
We can read in the file with correct encoding like this:
This happens because the scores in the file use the European decimal point, which confuses read_csv
.
To address this issue, we can also change the encoding to use European decimals, which fixes the problem:
Nothing gets converted correctly.
This is because the dates are in Spanish.
We can change the locale to use Spanish as the language for dates:
We can also reread the file using the correct locales:
Computing the students’ ages is now straightforward:
We see that two students where late.
However, with times we have to be particularly careful as some functions default to the UTC timezone:
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE