Read a file in R with mixed character encodings

html r character-encoding

There do seem to be R library functions for guessing character encodings, like stringi::stri_enc_detect, but when possible, it's probably better to use the simpler determinstic method of trying a fixed set of encodings in order. It looks like the best way to do this is to take advantage of the fact that when iconv fails to convert a string, it returns NA.

linewise.decode = function(path)    sapply(readLines(path), USE.NAMES = F, function(line) {        if (validUTF8(line))            return(line)        l2 = iconv(line, "Windows-1252", "UTF-8")        if (!is.na(l2))            return(l2)        l2 = iconv(line, "Shift-JIS", "UTF-8")        if (!is.na(l2))            return(l2)        stop("Encoding not detected")    })

If you create a test file with

$ python3 -c 'with open("inptest", "wb") as o: o.write(b"This line is ASCII\n" + "This line is UTF-8: I like π\n".encode("UTF-8") + "This line is Windows-1252: Müller\n".encode("Windows-1252") + "This line is Shift-JIS: ハローワールド\n".encode("Shift-JIS"))'

then linewise.decode("inptest") indeed returns

[1] "This line is ASCII"                    [2] "This line is UTF-8: I like π"          [3] "This line is Windows-1252: Müller"     [4] "This line is Shift-JIS: ハローワールド"

To use linewise.decode with XML::readHTMLTable, just say something like XML::readHTMLTable(linewise.decode("http://example.com")).

CodeHunter

Read a file in R with mixed character encodings

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last