R: extracting "clean" UTF-8 text from a web page scraped with RCurl

I seem to have found an answer and nobody else has yet posted one, so here goes.

Earlier @kohske commented that the code worked for him once the Encoding() call was removed. That got me thinking that he probably has a Japanese locale, which in turn suggested that there was a locale issue on my machine that somehow affects R in some way - even if Perl avoids the problem. I recalibrated my search and found this question on sourcing a UTF-8 file in which the original poster had run into a similar problem. The answer involved switching the locale. I experimented and found that switching my locale to Japanese seems to solve the problem, as this screenshot shows:

Output from updated R code

Updated R code follows.

require(RCurl)require(XML)links <- list()links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"print(Sys.getlocale(category = "LC_CTYPE"))original_ctype <- Sys.getlocale(category = "LC_CTYPE")Sys.setlocale("LC_CTYPE","japanese")txt <- getURL(links, .encoding = "UTF-8")write.table(txt, "c:/geturl_r.txt", quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8")Sys.setlocale("LC_CTYPE", original_ctype)

So we have to programmatically mess around with the locale. Frankly I'm a bit embarassed that we apparently need such a kludge for R on Windows in the year 2012. As I note above, Perl on the same version of Windows and in the same locale gets round the issue somehow, without requiring me to change my system settings.

The output of the updated R code above is HTML, of course. For those interested, the following code succeeds fairly well in stripping out the HTML and saving raw text, although the result needs quite a lot of tidying up.

require(RCurl)require(XML)links <- list()links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"print(Sys.getlocale(category = "LC_CTYPE"))original_ctype <- Sys.getlocale(category = "LC_CTYPE")Sys.setlocale("LC_CTYPE","japanese")txt <- getURL(links, .encoding = "UTF-8")myhtml <- htmlTreeParse(txt, useInternal = TRUE)cleantxt <- xpathApply(myhtml, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)write.table(cleantxt, "c:/geturl_r.txt", col.names = FALSE, quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8")Sys.setlocale("LC_CTYPE", original_ctype)

r web-scraping rcurl

Hi I have wrote a scraping engine that allows for the scraping of data on webpages that are deeply embedded within the main listing page. I wonder if it might be helpful to use it as an aggregator for your web data prior to importing in R?

The location to the engine is herehttp://ec2-204-236-207-28.compute-1.amazonaws.com/scrap-gm

The sample parameter I created to scrape the page you had in mind is as below.

{  origin_url: 'http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203',  columns: [    {      col_name: 'links_name',      dom_query: 'a'       }, {      col_name: 'links',      dom_query: 'a' ,      required_attribute: 'href'    }]};

CodeHunter

R: extracting "clean" UTF-8 text from a web page scraped with RCurl

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last