[1] "\"nombre\",\"f.n.\",\"estampa\",\"puntuaci\xf3n\""
[2] "\"Beyonc\xe9\",\"04 de septiembre de 1981\",2023-09-22 02:11:02,\"87,5\""
[3] "\"Bl\xfcmchen\",\"20 de abril de 1980\",2023-09-22 03:23:05,\"99,0\""
[4] "\"Jo\xe3o\",\"10 de junio de 1931\",2023-09-21 22:43:28,\"98,9\""
[5] "\"L\xf3pez\",\"24 de julio de 1969\",2023-09-22 01:06:59,\"88,7\""
[6] "\"\xd1engo\",\"15 de diciembre de 1981\",2023-09-21 23:35:37,\"93,1\""
[7] "\"Pl\xe1cido\",\"24 de enero de 1941\",2023-09-21 23:17:21,\"88,7\""
[8] "\"Thal\xeda\",\"26 de agosto de 1971\",2023-09-21 23:08:02,\"83,0\""
The unrecognizable characters actually lead to read.csv failing:
The locale is a group of information about your system. This includes the encoding, the language, and the time zone. This can affect how data is read into R. A mismatch of encodings creates weird problems often without warning or error.
You can use the stri_enc_detect function in the stringi package to predict the encoding of a character:
library(stringi)x <-readLines(fn, n =1)stri_enc_detect(x)
[[1]]
Encoding Language Confidence
1 ISO-8859-1 es 0.75
2 ISO-8859-2 cs 0.18
3 UTF-16BE 0.10
4 UTF-16LE 0.10
5 Shift_JIS ja 0.10
6 GB18030 zh 0.10
7 Big5 zh 0.10
We can also use this readr function to detect encoding of files:
The read_csv permits us to define elements of the encoding through the locale argument. It switches the local only temporarily, while running the parser read_csv. The locale for R remains the same after calling this.
x <-read_csv(fn, locale =locale(encoding ="ISO-8859-1"))
Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): nombre, f.n.
num (1): puntuación
dttm (1): estampa
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
x
# A tibble: 7 × 4
nombre f.n. estampa puntuación
<chr> <chr> <dttm> <dbl>
1 Beyoncé 04 de septiembre de 1981 2023-09-22 02:11:02 875
2 Blümchen 20 de abril de 1980 2023-09-22 03:23:05 990
3 João 10 de junio de 1931 2023-09-21 22:43:28 989
4 López 24 de julio de 1969 2023-09-22 01:06:59 887
5 Ñengo 15 de diciembre de 1981 2023-09-21 23:35:37 931
6 Plácido 24 de enero de 1941 2023-09-21 23:17:21 887
7 Thalía 26 de agosto de 1971 2023-09-21 23:08:02 830
Now notice the last column. Compare it to what we saw with readLines. They were numbers that used the European decimal point. This confuses read_csv. We can also change the encoding so the Europearn decimals are used.
x <-read_csv(fn, locale =locale(encoding ="ISO-8859-1", decimal_mark =","))
Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): nombre, f.n.
dbl (1): puntuación
dttm (1): estampa
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
x
# A tibble: 7 × 4
nombre f.n. estampa puntuación
<chr> <chr> <dttm> <dbl>
1 Beyoncé 04 de septiembre de 1981 2023-09-22 02:11:02 87.5
2 Blümchen 20 de abril de 1980 2023-09-22 03:23:05 99
3 João 10 de junio de 1931 2023-09-21 22:43:28 98.9
4 López 24 de julio de 1969 2023-09-22 01:06:59 88.7
5 Ñengo 15 de diciembre de 1981 2023-09-21 23:35:37 93.1
6 Plácido 24 de enero de 1941 2023-09-21 23:17:21 88.7
7 Thalía 26 de agosto de 1971 2023-09-21 23:08:02 83
Now let’s try to change the dates to date format:
library(lubridate)
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
dmy(x$f.n.)
Warning: All formats failed to parse. No formats found.
[1] NA NA NA NA NA NA NA
Nothing gets correctly converted. This is because the dates are in Spanish. You can change the locale to use Spanish as the language:
parse_date(x$f.n., format ="%d de %B de %Y", locale =locale(date_names ="es"))