R - c() unexpectedly converts names of named vectors into UTF-8. Is this a bug? R - c() unexpectedly converts names of named vectors into UTF-8. Is this a bug? r r

R - c() unexpectedly converts names of named vectors into UTF-8. Is this a bug?


You should still see names(c(x)) == names(x) on your system. The encoding change by c() may be unintentional, but shouldn't affect your code in most scenarios.

On Windows, which doesn't have a UTF-8 locale, your safest bet is to convert all strings to UTF-8 first via enc2utf8(), and then stay in UTF-8. This will also enable safe lookups.

Language symbols (as used in dplyr's group_by()) are an entirely different issue. For some reason they are always interpreted in the native encoding. (Try as.name(names(c(x))).) However, it's still best to have them in UTF-8, and convert to native just before calling as.name(). This is what dplyr should be doing, we're just not quite there yet.

My recommendation is to use ASCII-only characters for column names when using dplyr on Windows. This requires some discipline if you're relying on tidyr::spread() for non-ASCII column contents. You could also consider switching to a system (OS X or Linux) that works with UTF-8 natively.