Scrape password-protected website in R
You can use RSelenium. I have used the dev version as you can run phantomjs
without a Selenium Server.
# Install RSelenium if required. You will need phantomjs in your path or follow instructions# in package vignettes# devtools::install_github("ropensci/RSelenium")# login firstappURL <- 'http://subscribers.footballguys.com/amember/login.php'library(RSelenium)pJS <- phantom() # start phantomjsremDr <- remoteDriver(browserName = "phantomjs")remDr$open()remDr$navigate(appURL)remDr$findElement("id", "login")$sendKeysToElement(list("myusername"))remDr$findElement("id", "pass")$sendKeysToElement(list("mypass"))remDr$findElement("css", ".am-login-form input[type='submit']")$clickElement()appURL <- 'http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2'remDr$navigate(appURL)tableElem<- remDr$findElement("css", "table.datamedium")res <- readHTMLTable(header = TRUE, tableElem$getElementAttribute("outerHTML")[[1]])> res[[1]][1:5, ]Rank Name Tm/Bye Age Exp Cmp Att Cm% PYd Y/Att PTD Int Rsh Yd TD FantPt1 1 Peyton Manning DEN/4 38 17 415 620 66.9 4929 7.95 43 12 24 7 0 407.152 2 Drew Brees NO/6 35 14 404 615 65.7 4859 7.90 37 16 22 44 1 385.353 3 Aaron Rodgers GB/9 31 10 364 560 65.0 4446 7.94 33 13 52 224 3 381.704 4 Andrew Luck IND/10 25 3 366 610 60.0 4423 7.25 27 13 62 338 2 361.955 5 Matthew Stafford DET/9 26 6 377 643 58.6 4668 7.26 32 19 34 102 1 358.60
Finally when you are finished close phantomjs
pJS$stop()
If you want to use a traditional browser like firefox for example (if you wanted to stick to the version on CRAN) you would use:
RSelenium::startServer()remDr <- remoteDriver()................remDr$closeServer()
in place of the related phantomjs
calls.
I don't have an account to test with, but maybe this will work:
library(httr)library(XML)handle <- handle("http://subscribers.footballguys.com") path <- "amember/login.php"# fields found in the login form.login <- list( amember_login = "username" ,amember_pass = "password" ,amember_redirect_url = "http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2")response <- POST(handle = handle, path = path, body = login)
Now, the response object might hold what you need (or maybe you can directly query the page of interest after the login request; I am not sure the redirect will work, but it is a field in the web form), and handle
might be re-used for subsequent requests. Can't test it; but this works for me in many situations.
You can output the table using XML
> readHTMLTable(content(response))[[1]][1:5,] Rank Name Tm/Bye Age Exp Cmp Att Cm% PYd Y/Att PTD Int Rsh Yd TD FantPt1 1 Peyton Manning DEN/4 38 17 415 620 66.9 4929 7.95 43 12 24 7 0 407.152 2 Drew Brees NO/6 35 14 404 615 65.7 4859 7.90 37 16 22 44 1 385.353 3 Aaron Rodgers GB/9 31 10 364 560 65.0 4446 7.94 33 13 52 224 3 381.704 4 Andrew Luck IND/10 25 3 366 610 60.0 4423 7.25 27 13 62 338 2 361.955 5 Matthew Stafford DET/9 26 6 377 643 58.6 4668 7.26 32 19 34 102 1 358.60