strings are identical (using `base::identical`) and yet behave differently with `grepl` / `gsub`

r regex encoding

To overcome that problem, you need to make sure your pattern is Unicode-aware, so that \w could match all Unicode letters and digits and \b could match at Unicode word boundaries. That is possible by using the PCRE verb (*UCP):

gsub("(*UCP)\\b([A-Z])(\\w+)\\b", "\\1\\L\\2", x, perl = TRUE)      ^^^^^^

To make it fully Unicode use \p{Lu} instead of [A-Z]:

gsub("(*UCP)\\b(\\p{Lu})(\\w+)\\b", "\\1\\L\\2", x, perl = TRUE)

Also, if you do not want to match digits and _, you may replace \w with \p{L} (any letter):

gsub("(*UCP)\\b(\\p{Lu})(\\p{L}+)\\b", "\\1\\L\\2", x, perl = TRUE)

r regex encoding

If you check out the source of the identical() function, you can see that when it's passed a CHARSXP value (a character vector), it calls the internal helper function Seql(). That function converts string values to UTF prior to doing the comparison. Thus identical isn't checking that the encoding is necessarily the same, just that the value embded in the encoding is the same.

In a perfect world, the identical() function should have an ignore.encoding= option in addition to all the other properties you can ignore when doing a comparison.

But in theory the strings should really behave in the same way. So I guess you could blame the "perl" version of the regexpr engine here for not properly dealing with encoding. The base regexpr engine doesn't seem to have this problem

grepl("B\\w+", x)# [1] TRUEgrepl("B\\w+", y)# [1] TRUE

r regex encoding

@MrFlick explained very well the reasons behind the issue and @Wiktor-Stribiżew gave a great solution to use the perl regex engine with mixed encodings, which conserves the original encoding.

Now looking at the workflow, I believe in practice it is good to make sure to know what encoding one is working with at all times, and whenever it's acceptable, harmonize everything at the importation/fetching step or right after.

In the above case there's no reason not to harmonize the encoding right after the external data is retrieved to avoid such bad surprises.

This can be done by running as a second step:

x <- iconv(x, from="UTF-8", to="latin1")

CodeHunter

strings are identical (using `base::identical`) and yet behave differently with `grepl` / `gsub`

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last