R: Using rvest package instead of XML package to get links from URL
Despite my comment, here's how you can do it with rvest
. Note that we need to read in the page with htmlParse
first since the site has the content-type set to text/plain
for that file and that tosses rvest
into a tizzy.
library(rvest)library(XML)pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")pg %>% html_nodes("a") %>% html_attr("href")## [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html" ## [3] "/inf_corporativa66100_ACESEGC1.html" "/inf_corporativa71300_ADCOMEC1.html" ## ...## [273] "/inf_corporativa64801_VOLCAAC1.html" "/inf_corporativa58501_YURABC11.html" ## [275] "/inf_corporativa98959_ZNC.html"
That further illustrates rvest
's XML
package underpinnings.
UPDATE
rvest::read_html()
can handle this directly now:
pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")
I know you're looking for an rvest
answer, but here's another way using the XML
package that might be more efficient than what you're doing.
Have you seen the getLinks()
function in example(htmlParse)
? I use this modified version from the examples to get href
links. It's a handler function so we can collect the values as they are read, saving on memory and increasing efficiency.
links <- function(URL) { getLinks <- function() { links <- character() list(a = function(node, ...) { links <<- c(links, xmlGetAttr(node, "href")) node }, links = function() links) } h1 <- getLinks() htmlTreeParse(URL, handlers = h1) h1$links()}links("http://www.bvl.com.pe/includes/empresas_todas.dat")# [1] "/inf_corporativa71050_JAIME1CP1A.html"# [2] "/inf_corporativa10400_INTEGRC1.html" # [3] "/inf_corporativa66100_ACESEGC1.html" # [4] "/inf_corporativa71300_ADCOMEC1.html" # [5] "/inf_corporativa10250_HABITAC1.html" # [6] "/inf_corporativa77900_PARAMOC1.html" # [7] "/inf_corporativa77935_PUCALAC1.html" # [8] "/inf_corporativa77600_LAREDOC1.html" # [9] "/inf_corporativa21000_AIBC1.html" # ...# ...
# Option 1library(RCurl)getHTMLLinks('http://www.bvl.com.pe/includes/empresas_todas.dat')# Option 2library(rvest)library(pipeR) # %>>% will be faster than %>%html("http://www.bvl.com.pe/includes/empresas_todas.dat")%>>% html_nodes("a") %>>% html_attr("href")