strings are identical (using `base::identical`) and yet behave differently with `grepl` / `gsub` strings are identical (using `base::identical`) and yet behave differently with `grepl` / `gsub` r r

strings are identical (using `base::identical`) and yet behave differently with `grepl` / `gsub`


To overcome that problem, you need to make sure your pattern is Unicode-aware, so that \w could match all Unicode letters and digits and \b could match at Unicode word boundaries. That is possible by using the PCRE verb (*UCP):

gsub("(*UCP)\\b([A-Z])(\\w+)\\b", "\\1\\L\\2", x, perl = TRUE)      ^^^^^^

To make it fully Unicode use \p{Lu} instead of [A-Z]:

gsub("(*UCP)\\b(\\p{Lu})(\\w+)\\b", "\\1\\L\\2", x, perl = TRUE)

Also, if you do not want to match digits and _, you may replace \w with \p{L} (any letter):

gsub("(*UCP)\\b(\\p{Lu})(\\p{L}+)\\b", "\\1\\L\\2", x, perl = TRUE)


If you check out the source of the identical() function, you can see that when it's passed a CHARSXP value (a character vector), it calls the internal helper function Seql(). That function converts string values to UTF prior to doing the comparison. Thus identical isn't checking that the encoding is necessarily the same, just that the value embded in the encoding is the same.

In a perfect world, the identical() function should have an ignore.encoding= option in addition to all the other properties you can ignore when doing a comparison.

But in theory the strings should really behave in the same way. So I guess you could blame the "perl" version of the regexpr engine here for not properly dealing with encoding. The base regexpr engine doesn't seem to have this problem

grepl("B\\w+", x)# [1] TRUEgrepl("B\\w+", y)# [1] TRUE


@MrFlick explained very well the reasons behind the issue and @Wiktor-Stribiżew gave a great solution to use the perl regex engine with mixed encodings, which conserves the original encoding.

Now looking at the workflow, I believe in practice it is good to make sure to know what encoding one is working with at all times, and whenever it's acceptable, harmonize everything at the importation/fetching step or right after.

In the above case there's no reason not to harmonize the encoding right after the external data is retrieved to avoid such bad surprises.

This can be done by running as a second step:

x <- iconv(x, from="UTF-8", to="latin1")