how to preserve multi-byte characters after parse()
I'd be glad to see something simpler surface, but here's a start.
eval.utags <- function(x) { op <- options("useFancyQuotes") on.exit(options(useFancyQuotes=op)) options(useFancyQuotes=FALSE) # so dQuote/sQuote use ascii quotes # replace u-tag with u-escape, e.g., <U+12FF> --> \\u12FF with.uescapes <- gsub('<U\\+([[:xdigit:]]+)>', '\\\\u\\1', x) # find first quote char ('"' or "'"), if any # pick appropriate quote fun, dQuote or sQuote first.quote <- regmatches(with.uescapes, regexpr("\'|\"", with.uescapes)) quote.fun <- if (identical(first.quote, "'")) dQuote else sQuote # parse/eval quoted characters eval(parse(text=quote.fun(with.uescapes))) }x <- '<U+011f><U+4f60><U+597d>abc'y <- eval.utags(x)y# [1] "ğ你好abc"Encoding(y)# "UTF-8"
EDIT:
If your original string may have literal unicode tag substrings that you want to preserve as is, before passing it to parse
, gsub
all instances of "<U+"
with the equivalent unicode tags, "<U+003c><U+0055><U+002b>"
.
x <- "'Щ<U+1234>'"y <- eval(parse(text=gsub('<U\\+', '<U+003c><U+0055><U+002b>', x)))# [1] "<U+0429><U+003c><U+0055><U+002b>1234>"z <- eval.utags(y)# [1] "Щ<U+1234>"
This, of course, isn't full proof, though.
It's really a shame this has to be so hackish.
The root of the problem, is that (quoting R Installation and administration manual): "R supports all the character sets that the underlying OS can handle. These are interpreted according to the current locale". And unfortunately Windows has no locale supporting UTF-8.
Now, the good thing is that Rgui apparently supports UTF-8 (scroll down to 2.7.0 > Internationalization). The R parser though, works only with the characters supported in the locale. So a solution that worked for me is to temporarily change the R locale with Sys.setlocale()
just to do the parsing, and later when deparsing use iconv()
to convert to UTF-8:
> Sys.getlocale()[1] "LC_COLLATE=Greek_Greece.1253;LC_CTYPE=Greek_Greece.1253;LC_MONETARY=Greek_Greece.1253;LC_NUMERIC=C;LC_TIME=Greek_Greece.1253"> orig.locale <- Sys.getlocale("LC_CTYPE")> parse(text="'你好'")expression('<U+4F60><U+597D>')> Sys.setlocale(locale="Chinese")[1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"> a <- parse(text="'你好'")> aexpression('你好')> Sys.setlocale(locale="Turkish")[1] "LC_COLLATE=Turkish_Turkey.1254;LC_CTYPE=Turkish_Turkey.1254;LC_MONETARY=Turkish_Turkey.1254;LC_NUMERIC=C;LC_TIME=Turkish_Turkey.1254"> b <- parse(text="'ğ'")> bexpression('ğ')> Sys.setlocale(locale=orig.locale)[1] "LC_COLLATE=Greek_Greece.1253;LC_CTYPE=Greek_Greece.1253;LC_MONETARY=Greek_Greece.1253;LC_NUMERIC=C;LC_TIME=Greek_Greece.1253"> a[1] expression('ΔγΊΓ')> b[1] expression('π')> ai <- iconv(a, from="CP936", to="UTF-8")> ai[1] "你好"> bi <- iconv(b, from="CP1254", to="UTF-8")> bi[1] "ğ"
Hope this helps!