R: Using rvest package instead of XML package to get links from URL

xml r web-scraping rvest

Despite my comment, here's how you can do it with rvest. Note that we need to read in the page with htmlParse first since the site has the content-type set to text/plain for that file and that tosses rvest into a tizzy.

library(rvest)library(XML)pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")pg %>% html_nodes("a") %>% html_attr("href")##   [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html"  ##   [3] "/inf_corporativa66100_ACESEGC1.html"   "/inf_corporativa71300_ADCOMEC1.html"  ## ...## [273] "/inf_corporativa64801_VOLCAAC1.html"   "/inf_corporativa58501_YURABC11.html"  ## [275] "/inf_corporativa98959_ZNC.html"

That further illustrates rvest's XML package underpinnings.

UPDATE

rvest::read_html() can handle this directly now:

pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")

xml r web-scraping rvest

I know you're looking for an rvest answer, but here's another way using the XML package that might be more efficient than what you're doing.

Have you seen the getLinks() function in example(htmlParse)? I use this modified version from the examples to get href links. It's a handler function so we can collect the values as they are read, saving on memory and increasing efficiency.

links <- function(URL) {    getLinks <- function() {        links <- character()        list(a = function(node, ...) {                links <<- c(links, xmlGetAttr(node, "href"))                node             },             links = function() links)        }    h1 <- getLinks()    htmlTreeParse(URL, handlers = h1)    h1$links()}links("http://www.bvl.com.pe/includes/empresas_todas.dat")#  [1] "/inf_corporativa71050_JAIME1CP1A.html"#  [2] "/inf_corporativa10400_INTEGRC1.html"  #  [3] "/inf_corporativa66100_ACESEGC1.html"  #  [4] "/inf_corporativa71300_ADCOMEC1.html"  #  [5] "/inf_corporativa10250_HABITAC1.html"  #  [6] "/inf_corporativa77900_PARAMOC1.html"  #  [7] "/inf_corporativa77935_PUCALAC1.html"  #  [8] "/inf_corporativa77600_LAREDOC1.html"  #  [9] "/inf_corporativa21000_AIBC1.html"     #  ...#  ...

xml r web-scraping rvest

# Option 1library(RCurl)getHTMLLinks('http://www.bvl.com.pe/includes/empresas_todas.dat')# Option 2library(rvest)library(pipeR) # %>>% will be faster than %>%html("http://www.bvl.com.pe/includes/empresas_todas.dat")%>>% html_nodes("a") %>>% html_attr("href")

CodeHunter

R: Using rvest package instead of XML package to get links from URL

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last