how to preserve multi-byte characters after parse() how to preserve multi-byte characters after parse() windows windows

how to preserve multi-byte characters after parse()


I'd be glad to see something simpler surface, but here's a start.

eval.utags <- function(x) {    op <- options("useFancyQuotes")    on.exit(options(useFancyQuotes=op))    options(useFancyQuotes=FALSE) # so dQuote/sQuote use ascii quotes    # replace u-tag with u-escape, e.g., <U+12FF> --> \\u12FF    with.uescapes <- gsub('<U\\+([[:xdigit:]]+)>', '\\\\u\\1', x)    # find first quote char ('"' or "'"), if any    # pick appropriate quote fun, dQuote or sQuote    first.quote <- regmatches(with.uescapes, regexpr("\'|\"", with.uescapes))    quote.fun <- if (identical(first.quote, "'")) dQuote else sQuote    # parse/eval quoted characters     eval(parse(text=quote.fun(with.uescapes))) }x <- '<U+011f><U+4f60><U+597d>abc'y <- eval.utags(x)y# [1] "ğ你好abc"Encoding(y)# "UTF-8"

EDIT:

If your original string may have literal unicode tag substrings that you want to preserve as is, before passing it to parse, gsub all instances of "<U+" with the equivalent unicode tags, "<U+003c><U+0055><U+002b>".

x <- "'Щ<U+1234>'"y <- eval(parse(text=gsub('<U\\+', '<U+003c><U+0055><U+002b>', x)))# [1] "<U+0429><U+003c><U+0055><U+002b>1234>"z <- eval.utags(y)# [1] "Щ<U+1234>"

This, of course, isn't full proof, though.

It's really a shame this has to be so hackish.


The root of the problem, is that (quoting R Installation and administration manual): "R supports all the character sets that the underlying OS can handle. These are interpreted according to the current locale". And unfortunately Windows has no locale supporting UTF-8.

Now, the good thing is that Rgui apparently supports UTF-8 (scroll down to 2.7.0 > Internationalization). The R parser though, works only with the characters supported in the locale. So a solution that worked for me is to temporarily change the R locale with Sys.setlocale() just to do the parsing, and later when deparsing use iconv() to convert to UTF-8:

> Sys.getlocale()[1] "LC_COLLATE=Greek_Greece.1253;LC_CTYPE=Greek_Greece.1253;LC_MONETARY=Greek_Greece.1253;LC_NUMERIC=C;LC_TIME=Greek_Greece.1253"> orig.locale <- Sys.getlocale("LC_CTYPE")> parse(text="'你好'")expression('<U+4F60><U+597D>')> Sys.setlocale(locale="Chinese")[1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"> a <- parse(text="'你好'")> aexpression('你好')> Sys.setlocale(locale="Turkish")[1] "LC_COLLATE=Turkish_Turkey.1254;LC_CTYPE=Turkish_Turkey.1254;LC_MONETARY=Turkish_Turkey.1254;LC_NUMERIC=C;LC_TIME=Turkish_Turkey.1254"> b <- parse(text="'ğ'")> bexpression('ğ')> Sys.setlocale(locale=orig.locale)[1] "LC_COLLATE=Greek_Greece.1253;LC_CTYPE=Greek_Greece.1253;LC_MONETARY=Greek_Greece.1253;LC_NUMERIC=C;LC_TIME=Greek_Greece.1253"> a[1] expression('ΔγΊΓ')> b[1] expression('π')> ai <- iconv(a, from="CP936", to="UTF-8")> ai[1] "你好"> bi <- iconv(b, from="CP1254", to="UTF-8")> bi[1] "ğ"

Hope this helps!