How to write Unicode string to text file in R Windows?

r unicode encoding utf-8

I think setting the Encoding of (a copy of) str to "unknown" before using cat() is less magic and works just as well. I think that should avoid any unwanted character set conversions in cat().

Here is an expanded example to demonstrate what I think happens in the original example:

print_info <- function(x) {    print(x)    print(Encoding(x))    str(x)    print(charToRaw(x))}cat("(1) Original string (UTF-8)\n")str <- "\xe1\xbb\x8f"Encoding(str) <- "UTF-8"print_info(str)cat(str, file="no-iconv")cat("\n(2) Conversion to UTF-8, wrong input encoding (latin1)\n")## from = "" is conversion from current locale, forcing "latin1" herestr2 <- iconv(str, from="latin1", to="UTF-8")print_info(str2)cat(str2, file="yes-iconv")cat("\n(3) Converting (2) explicitly to latin1\n")str3 <- iconv(str2, from="UTF-8", to="latin1")print_info(str3)cat(str3, file="latin")cat("\n(4) Setting encoding of (1) to \"unknown\"\n")str4 <- strEncoding(str4) <- "unknown"print_info(str4)cat(str4, file="unknown")

In a "Latin-1" locale (see ?l10n_info) as used by R on Windows, output files "yes-iconv", "latin" and "unknown" should be correct (byte sequence 0xe1, 0xbb, 0x8f which is "ỏ").

In a "UTF-8" locale, files "no-iconv" and "unknown" should be correct.

The output of the example code is as follows, using R 3.3.2 64-bit Windows version running on Wine:

(1) Original string (UTF-8)[1] "ỏ"[1] "UTF-8" chr "<U+1ECF>""| __truncated__[1] e1 bb 8f(2) Conversion to UTF-8, wrong input encoding (latin1)[1] "á»\u008f"[1] "UTF-8" chr "á»\u008f"[1] c3 a1 c2 bb c2 8f(3) Converting (2) explicitly to latin1[1] "á»"[1] "latin1" chr "á»"[1] e1 bb 8f(4) Setting encoding of (1) to "unknown"[1] "á»"[1] "unknown" chr "á»"[1] e1 bb 8f

In the original example, iconv() uses the default from = "" argument which means conversion from the current locale, which is effectively "latin1". Because the encoding of str is actually "UTF-8", the byte representation of the string is distorted in step (2), but then implicitly restored by cat() when it (presumably) converts the string back to the current locale, as demonstrated by the equivalent conversion in step (3).

r unicode encoding utf-8

Somehow I didn't get anything to work with the above suggestions. I'm working in Windows, and that may have something to do with it. Windows apparently has different encodings for different locales. But I did find this excellent post by Kevin Ushey:

https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/

He suggests the following technique, which worked for me:

# Create temp file namef <- tempfile(tmpdir = tempdir(), fileext = ".txt")# Vector of crazy stuffv <- c("Crazy stuff: Ω µ ", "β ¥ ∑ ", "≠ ≤ £ ∞ ؈ ლ ")# Ensure strings are encoded as UTF-8utf8 <- enc2utf8(v)# Use native encoding on file connectioncon <- file(f, open = "w", encoding = "native.enc")# Use useBytes = TRUEwriteLines(utf8, con = con, useBytes = TRUE)# Close connectionclose(con)# View resultsx <- readLines(f, encoding = "UTF-8")cat(x, sep = "\n")# Crazy stuff: Ω µ # ß ¥ ∑ # ≠ = £ 8 ؈ ლ

You can see that everything came out perfectly except the infinity symbol, which is turned 90 degrees. If anyone can figure that out, please leave a comment.

CodeHunter

How to write Unicode string to text file in R Windows?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last