R: extracting "clean" UTF-8 text from a web page scraped with RCurl R: extracting "clean" UTF-8 text from a web page scraped with RCurl r r

R: extracting "clean" UTF-8 text from a web page scraped with RCurl


I seem to have found an answer and nobody else has yet posted one, so here goes.

Earlier @kohske commented that the code worked for him once the Encoding() call was removed. That got me thinking that he probably has a Japanese locale, which in turn suggested that there was a locale issue on my machine that somehow affects R in some way - even if Perl avoids the problem. I recalibrated my search and found this question on sourcing a UTF-8 file in which the original poster had run into a similar problem. The answer involved switching the locale. I experimented and found that switching my locale to Japanese seems to solve the problem, as this screenshot shows:

Output from updated R code

Updated R code follows.

require(RCurl)require(XML)links <- list()links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"print(Sys.getlocale(category = "LC_CTYPE"))original_ctype <- Sys.getlocale(category = "LC_CTYPE")Sys.setlocale("LC_CTYPE","japanese")txt <- getURL(links, .encoding = "UTF-8")write.table(txt, "c:/geturl_r.txt", quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8")Sys.setlocale("LC_CTYPE", original_ctype)

So we have to programmatically mess around with the locale. Frankly I'm a bit embarassed that we apparently need such a kludge for R on Windows in the year 2012. As I note above, Perl on the same version of Windows and in the same locale gets round the issue somehow, without requiring me to change my system settings.

The output of the updated R code above is HTML, of course. For those interested, the following code succeeds fairly well in stripping out the HTML and saving raw text, although the result needs quite a lot of tidying up.

require(RCurl)require(XML)links <- list()links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"print(Sys.getlocale(category = "LC_CTYPE"))original_ctype <- Sys.getlocale(category = "LC_CTYPE")Sys.setlocale("LC_CTYPE","japanese")txt <- getURL(links, .encoding = "UTF-8")myhtml <- htmlTreeParse(txt, useInternal = TRUE)cleantxt <- xpathApply(myhtml, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)write.table(cleantxt, "c:/geturl_r.txt", col.names = FALSE, quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8")Sys.setlocale("LC_CTYPE", original_ctype)


Hi I have wrote a scraping engine that allows for the scraping of data on webpages that are deeply embedded within the main listing page. I wonder if it might be helpful to use it as an aggregator for your web data prior to importing in R?

The location to the engine is herehttp://ec2-204-236-207-28.compute-1.amazonaws.com/scrap-gm

The sample parameter I created to scrape the page you had in mind is as below.

{  origin_url: 'http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203',  columns: [    {      col_name: 'links_name',      dom_query: 'a'       }, {      col_name: 'links',      dom_query: 'a' ,      required_attribute: 'href'    }]};