Extract Links from Webpage using R Extract Links from Webpage using R r r

Extract Links from Webpage using R


Even easier with rvest:

library(xml2)library(rvest)URL <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"pg <- read_html(URL)head(html_attr(html_nodes(pg, "a"), "href"))## [1] "//stackoverflow.com"                                                                                                                                          ## [2] "http://chat.stackoverflow.com"                                                                                                                                ## [3] "//stackoverflow.com"                                                                                                                                          ## [4] "http://meta.stackoverflow.com"                                                                                                                                ## [5] "//careers.stackoverflow.com?utm_source=stackoverflow.com&utm_medium=site-ui&utm_campaign=multicollider"                                                       ## [6] "https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=http%3a%2f%2fstackoverflow.com%2fquestions%2f3746256%2fextract-links-from-webpage-using-r"


The documentation for htmlTreeParse shows one method. Here's another:

> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"> doc <- htmlParse(url)> links <- xpathSApply(doc, "//a/@href")> free(doc)

(You can drop the "href" attribute from the returned links by passing "links" through "as.vector".)

My previous reply:

One approach is to use Hadley Wickham's stringr package, which you can install with install.packages("stringr", dep=TRUE).

> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"> html <- paste(readLines(url), collapse="\n")> library(stringr)> matched <- str_match_all(html, "<a href=\"(.*?)\"")

(I guess some people might not approve of using regexp's here.)

matched is a list of matrixes, one per input string in the vector html -- since that has length one here, matched just has one element. The matches for the first capture group are in column 2 of this matrix (and in general, the ith group would appear in column (i + 1)).

> links <- matched[[1]][, 2]> head(links)[1] "/users/login?returnurl=%2fquestions%2f3746256%2fextract-links-from-webpage-using-r"[2] "http://careers.stackoverflow.com"                                                  [3] "http://meta.stackoverflow.com"                                                     [4] "/about"                                                                            [5] "/faq"                                                                              [6] "/"


You might try

htmlcode = read_html("URL")nodes=html_nodes(htmlcode,xpath='//*[contains(@href, "SEARCHTERM")]') %>% html_attr("href")df=as.data.frame(as.character(nodes))names(df)="link"