[1] "\"nombre\",\"f.n.\",\"estampa\",\"puntuaci\xf3n\""
[2] "\"Beyonc\xe9\",\"04 de septiembre de 1981\",2023-09-22 02:11:02,\"87,5\""
[3] "\"Bl\xfcmchen\",\"20 de abril de 1980\",2023-09-22 03:23:05,\"99,0\""
[4] "\"Jo\xe3o\",\"10 de junio de 1931\",2023-09-21 22:43:28,\"98,9\""
[5] "\"L\xf3pez\",\"24 de julio de 1969\",2023-09-22 01:06:59,\"88,7\""
[6] "\"\xd1engo\",\"15 de diciembre de 1981\",2023-09-21 23:35:37,\"93,1\""
[7] "\"Pl\xe1cido\",\"24 de enero de 1941\",2023-09-21 23:17:21,\"88,7\""
[8] "\"Thal\xeda\",\"26 de agosto de 1971\",2023-09-21 23:08:02,\"83,0\""
The unrecognizable characters actually lead to read.csv failing:
The locale is a group of information about your system. This includes the encoding, the language, and the time zone. This can affect how data is read into R. A mismatch of encodings creates weird problems often without warning or error.
You can use the stri_enc_detect function in the stringi package to predict the encoding of a character:
library(stringi)x <-readLines(fn, n =1)stri_enc_detect(x)
Encoding Language Confidence
1 ISO-8859-1 es 0.75
2 ISO-8859-2 cs 0.18
3 UTF-16BE 0.10
4 UTF-16LE 0.10
5 Shift_JIS ja 0.10
6 GB18030 zh 0.10
7 Big5 zh 0.10
We can also use this readr function to detect encoding of files:
The read_csv permits us to define elements of the encoding through the locale argument. It switches the local only temporarily, while running the parser read_csv. The locale for R remains the same after calling this.
x <-read_csv(fn, locale =locale(encoding ="ISO-8859-1"))
Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): nombre, f.n.
num (1): puntuación
dttm (1): estampa
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 7 × 4
nombre f.n. estampa puntuación
<chr> <chr> <dttm> <dbl>
1 Beyoncé 04 de septiembre de 1981 2023-09-22 02:11:02 875
2 Blümchen 20 de abril de 1980 2023-09-22 03:23:05 990
3 João 10 de junio de 1931 2023-09-21 22:43:28 989
4 López 24 de julio de 1969 2023-09-22 01:06:59 887
5 Ñengo 15 de diciembre de 1981 2023-09-21 23:35:37 931
6 Plácido 24 de enero de 1941 2023-09-21 23:17:21 887
7 Thalía 26 de agosto de 1971 2023-09-21 23:08:02 830
Now notice the last column. Compare it to what we saw with readLines. They were numbers that used the European decimal point. This confuses read_csv. We can also change the encoding so the Europearn decimals are used.
x <-read_csv(fn, locale =locale(encoding ="ISO-8859-1", decimal_mark =","))
Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): nombre, f.n.
dbl (1): puntuación
dttm (1): estampa
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 7 × 4
nombre f.n. estampa puntuación
<chr> <chr> <dttm> <dbl>
1 Beyoncé 04 de septiembre de 1981 2023-09-22 02:11:02 87.5
2 Blümchen 20 de abril de 1980 2023-09-22 03:23:05 99
3 João 10 de junio de 1931 2023-09-21 22:43:28 98.9
4 López 24 de julio de 1969 2023-09-22 01:06:59 88.7
5 Ñengo 15 de diciembre de 1981 2023-09-21 23:35:37 93.1
6 Plácido 24 de enero de 1941 2023-09-21 23:17:21 88.7
7 Thalía 26 de agosto de 1971 2023-09-21 23:08:02 83
Now let’s try to change the dates to date format:
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
Warning: All formats failed to parse. No formats found.
Nothing gets correctly converted. This is because the dates are in Spanish. You can change the locale to use Spanish as the language:
parse_date(x$f.n., format ="%d de %B de %Y", locale =locale(date_names ="es"))