R: Using rvest package instead of XML package to get links from URL R: Using rvest package instead of XML package to get links from URL xml xml

R: Using rvest package instead of XML package to get links from URL


Despite my comment, here's how you can do it with rvest. Note that we need to read in the page with htmlParse first since the site has the content-type set to text/plain for that file and that tosses rvest into a tizzy.

library(rvest)library(XML)pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")pg %>% html_nodes("a") %>% html_attr("href")##   [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html"  ##   [3] "/inf_corporativa66100_ACESEGC1.html"   "/inf_corporativa71300_ADCOMEC1.html"  ## ...## [273] "/inf_corporativa64801_VOLCAAC1.html"   "/inf_corporativa58501_YURABC11.html"  ## [275] "/inf_corporativa98959_ZNC.html"  

That further illustrates rvest's XML package underpinnings.

UPDATE

rvest::read_html() can handle this directly now:

pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")


I know you're looking for an rvest answer, but here's another way using the XML package that might be more efficient than what you're doing.

Have you seen the getLinks() function in example(htmlParse)? I use this modified version from the examples to get href links. It's a handler function so we can collect the values as they are read, saving on memory and increasing efficiency.

links <- function(URL) {    getLinks <- function() {        links <- character()        list(a = function(node, ...) {                links <<- c(links, xmlGetAttr(node, "href"))                node             },             links = function() links)        }    h1 <- getLinks()    htmlTreeParse(URL, handlers = h1)    h1$links()}links("http://www.bvl.com.pe/includes/empresas_todas.dat")#  [1] "/inf_corporativa71050_JAIME1CP1A.html"#  [2] "/inf_corporativa10400_INTEGRC1.html"  #  [3] "/inf_corporativa66100_ACESEGC1.html"  #  [4] "/inf_corporativa71300_ADCOMEC1.html"  #  [5] "/inf_corporativa10250_HABITAC1.html"  #  [6] "/inf_corporativa77900_PARAMOC1.html"  #  [7] "/inf_corporativa77935_PUCALAC1.html"  #  [8] "/inf_corporativa77600_LAREDOC1.html"  #  [9] "/inf_corporativa21000_AIBC1.html"     #  ...#  ...


# Option 1library(RCurl)getHTMLLinks('http://www.bvl.com.pe/includes/empresas_todas.dat')# Option 2library(rvest)library(pipeR) # %>>% will be faster than %>%html("http://www.bvl.com.pe/includes/empresas_todas.dat")%>>% html_nodes("a") %>>% html_attr("href")