Extract Links from Webpage using R
Even easier with rvest
:
library(xml2)library(rvest)URL <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"pg <- read_html(URL)head(html_attr(html_nodes(pg, "a"), "href"))## [1] "//stackoverflow.com" ## [2] "http://chat.stackoverflow.com" ## [3] "//stackoverflow.com" ## [4] "http://meta.stackoverflow.com" ## [5] "//careers.stackoverflow.com?utm_source=stackoverflow.com&utm_medium=site-ui&utm_campaign=multicollider" ## [6] "https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=http%3a%2f%2fstackoverflow.com%2fquestions%2f3746256%2fextract-links-from-webpage-using-r"
The documentation for htmlTreeParse
shows one method. Here's another:
> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"> doc <- htmlParse(url)> links <- xpathSApply(doc, "//a/@href")> free(doc)
(You can drop the "href" attribute from the returned links by passing "links" through "as.vector".)
My previous reply:
One approach is to use Hadley Wickham's stringr
package, which you can install with install.packages("stringr", dep=TRUE).
> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"> html <- paste(readLines(url), collapse="\n")> library(stringr)> matched <- str_match_all(html, "<a href=\"(.*?)\"")
(I guess some people might not approve of using regexp's here.)
matched
is a list of matrixes, one per input string in the vector html -- since that has length one here, matched just has one element. The matches for the first capture group are in column 2 of this matrix (and in general, the ith group would appear in column (i + 1)).
> links <- matched[[1]][, 2]> head(links)[1] "/users/login?returnurl=%2fquestions%2f3746256%2fextract-links-from-webpage-using-r"[2] "http://careers.stackoverflow.com" [3] "http://meta.stackoverflow.com" [4] "/about" [5] "/faq" [6] "/"
You might try
htmlcode = read_html("URL")nodes=html_nodes(htmlcode,xpath='//*[contains(@href, "SEARCHTERM")]') %>% html_attr("href")df=as.data.frame(as.character(nodes))names(df)="link"