How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)? How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)? xml xml

How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?


The new package httr provides a wrapper around RCurl to make it easier to scrape all kinds of pages.

Still, this page gave me a fair amount of trouble. The following works, but no doubt there are easier ways of doing it.

library("httr")library("XML")# Define certicificate filecafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")# Read pagepage <- GET(  "https://ned.nih.gov/",   path="search/ViewDetails.aspx",   query="NIHID=0010121048",  config(cainfo = cafile))# Use regex to extract the desired tablex <- text_content(page)tab <- sub('.*(<table class="grid".*?>.*</table>).*', '\\1', x)# Parse the tablereadHTMLTable(tab)

The results:

$ctl00_ContentPlaceHolder_dvPerson                V1                                      V21      Legal Name:                    Dr Francis S Collins2  Preferred Name:                      Dr Francis Collins3          E-mail:                 francis.collins@nih.gov4        Location: BG 1 RM 1261 CENTER DRBETHESDA MD 208145       Mail Stop:                                       Â6           Phone:                            301-496-24337             Fax:                                       Â8              IC:             OD (Office of the Director)9    Organization:            Office of the Director (HNA)10 Classification:                                Employee11            TTY:                                       Â

Get httr here: http://cran.r-project.org/web/packages/httr/index.html


EDIT: Useful page with FAQ about the RCurl package: http://www.omegahat.org/RCurl/FAQ.html


Using Andrie's great way to get past the https

a way to get at the data without readHTMLTable is also below.

A table in HTML may have an ID. In this case the table has one nice one and the XPath in getNodeSet function does it nicely.

# Define certicificate filecafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")# Read pagepage <- GET(  "https://ned.nih.gov/",   path="search/ViewDetails.aspx",   query="NIHID=0010121048",  config(cainfo = cafile, ssl.verifypeer = FALSE))h = htmlParse(page)ns <- getNodeSet(h, "//table[@id = 'ctl00_ContentPlaceHolder_dvPerson']")ns

I still need to extract the IDs behind the hyperlinks.

for example instead of collen baros as manager, I need to get to the ID 0010080638

Manager:Colleen Barros


This is the function I have to deal with this problem. Detects if https in url and uses httr if it is.

readHTMLTable2=function(url, which=NULL, ...){ require(httr) require(XML) if(str_detect(url,"https")){    page <- GET(url, user_agent("httr-soccer-ranking"))    doc = htmlParse(text_content(page))    if(is.null(which)){      tmp=readHTMLTable(doc, ...)      }else{        tableNodes = getNodeSet(doc, "//table")        tab=tableNodes[[which]]        tmp=readHTMLTable(tab, ...)       }  }else{    tmp=readHTMLTable(url, which=which, ...)   }  return(tmp)}