Using R to download data automatically Using R to download data automatically selenium selenium

Using R to download data automatically


First, check robots.txt on the website if there is any. Then read the terms and conditions if there is any. And it is always important to throttle the request below.

After checking all the terms and conditions, the code below should get you started:

library(httr)library(xml2)link <- "https://aps.dac.gov.in/LUS/Public/Reports.aspx"r <- GET(link)doc <- read_html(content(r, "text"))#write_html(doc, "temp.html")states <- sapply(xml_find_all(doc, ".//select[@name='DdlState']/option"), function(x)    setNames(xml_attr(x, "value"), xml_text(x)))states <- states[!grepl("^Select", names(states))]years <- sapply(xml_find_all(doc, ".//select[@name='DdlYear']/option"), function(x)    setNames(xml_attr(x, "value"), xml_text(x)))years <- years[!grepl("^Select", names(years))]rptfmt <- sapply(xml_find_all(doc, ".//select[@name='DdlFormat']/option"), function(x)    setNames(xml_attr(x, "value"), xml_text(x)))stdrpts <- unlist(lapply(xml_find_all(doc, ".//td/a"), function(x) {    id <- xml_attr(x, "id")    if (grepl("^TreeView1t", id)) return(setNames(id, xml_text(x)))}))get_vs <- function(doc) sapply(xml_find_all(doc, ".//input[@type='hidden']"), function(x)    setNames(xml_attr(x, "value"), xml_attr(x, "name")))fmt <- rptfmt[2] #Excel formatfor (sn in names(states)) {    for (yn in names(years)) {        for (srn in seq_along(stdrpts)) {            s <- states[sn]            y <- years[yn]            sr <- stdrpts[srn]            r <- POST(link,                body=as.list(c("__EVENTTARGET"="DdlState",                    "__EVENTARGUMENT"="",                    "__LASTFOCUS"="",                    "TreeView1_ExpandState"="ennnn",                    "TreeView1_SelectedNode"="",                    "TreeView1_PopulateLog"="",                    get_vs(doc),                    DdlState=unname(s),                    DdlYear=0,                    DdlFormat=1)),                encode="form")            doc <- read_html(content(r, "text"))            treeview <- c("__EVENTTARGET"="TreeView1",                "__EVENTARGUMENT"=paste0("sStandard Reports\\", srn),                "__LASTFOCUS"="",                "TreeView1_ExpandState"="ennnn",                "TreeView1_SelectedNode"=unname(stdrpts[srn]),                "TreeView1_PopulateLog"="")            vs <- get_vs(doc)            ddl <- c(DdlState=unname(s), DdlYear=unname(y), DdlFormat=unname(fmt))            r <- POST(link, body=as.list(c(treeview, vs, ddl)), encode="form")            if (r$headers$`content-type`=="application/vnd.ms-excel")                writeBin(content(r, "raw"), paste0(sn, "_", yn, "_", names(stdrpts)[srn], ".xls"))            Sys.sleep(5)        }    }}


Here is my best attempt:

If you look in the network activities you will see a post request is sent:

enter image description here

Request body data:

If you scroll down you will see the form data that is used.

body <- structure(list(`__EVENTTARGET` = "TreeView1", `__EVENTARGUMENT` = "sStandard+Reports%5C4",                        `__LASTFOCUS` = "", TreeView1_ExpandState = "ennnn", TreeView1_SelectedNode = "TreeView1t4",                        TreeView1_PopulateLog = "", `__VIEWSTATE` = "", `__VIEWSTATEGENERATOR` = "",                        `__VIEWSTATEENCRYPTED` = "", `__EVENTVALIDATION` = "", DdlState = "35",                        DdlYear = "2001", DdlFormat = "1"), .Names = c("__EVENTTARGET",                                                                       "__EVENTARGUMENT", "__LASTFOCUS", "TreeView1_ExpandState", "TreeView1_SelectedNode",                                                                       "TreeView1_PopulateLog", "__VIEWSTATE", "__VIEWSTATEGENERATOR",                                                                       "__VIEWSTATEENCRYPTED", "__EVENTVALIDATION", "DdlState", "DdlYear",                                                                       "DdlFormat"))

There are certain session related values:

attr_names <- c("__EVENTVALIDATION", "__VIEWSTATEGENERATOR", "__VIEWSTATE", "__VIEWSTATEENCRYPTED")

You could add them like this:

setAttrNames <- function(attr_name){  name <- doc %>%     html_nodes(xpath = glue("//*[@id = '{attr_name}']")) %>%     html_attr(name = "value")  body[[attr_name]] <<- name}

Then you can add this session specific values:

library(rvest)library(glue)url <- "https://aps.dac.gov.in/LUS/Public/Reports.aspx"doc <- url %>% GET %>% content("text") %>% read_htmlsapply(attr_names, setAttrNames)

Sending the request:

Then you can send the request:

response <- POST(  url = url,   encode = "form",   body = body,  hdrs)response$status_code # still indicates that we have an error in the request.

Follow up ideas:

  1. I checked for cookies. There is a session cookie, but it does not seem to be necessary for the request.

    1. Adding headers.

Trying to set the request headers

header <- structure(c("aps.dac.gov.in", "keep-alive", "3437", "max-age=0",                       "https://aps.dac.gov.in", "1", "application/x-www-form-urlencoded",                       "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36",                       "?1", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",                       "same-origin", "navigate", "https://aps.dac.gov.in/LUS/Public/Reports.aspx",                       "gzip, deflate, br", "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"), .Names = c("Host",                                                                                               "Connection", "Content-Length", "Cache-Control", "Origin", "Upgrade-Insecure-Requests",                                                                                               "Content-Type", "User-Agent", "Sec-Fetch-User", "Accept", "Sec-Fetch-Site",                                                                                               "Sec-Fetch-Mode", "Referer", "Accept-Encoding", "Accept-Language"                      ))hdrs <- header %>% add_headersresponse <- POST(  url = url,   encode = "form",   body = body,  hdrs)

But i get a timeout for this request.

Note: The site does not seem to have a robots.txt. But check the Terms and Conditions of the site.


I tried running these 2 lines myself at work and got somewhat a more explicit error message than you.

Could not open chrome browser.Client error message:     Summary: UnknownError     Detail: An unknown server-side error occurred while processing the command.     Further Details: run errorDetails methodCheck server log for further details.

error

It might be because if you are at work without admin privileges, R can't create a child process.

As a matter of fact I used to run into absolutely awful problems myself trying to build a bot using RSelenium. rsDriver() was not consistent at all and kept crashing. I had to include it in a loop with error catching in order to keep it running, but then I had to find out and delete gigabytes of temp files manually. I tried to install Docker and spent a lot of time doing the setup but finally it wasn't supported on my Windows non-professional edition.

Solution: Selenium from Python is very well documented, never crashes, works like a charm. Coding in the interactive Spyder editor from Anaconda feels almost like R.

And of course you can use something like system("python myscript.py") from R in order to get the process started and the resulting files back into R if you wish so.

EDIT: No admin privileges are required at all for Anaconda or Selenium. I run it myself without any problem from work. If you have trouble with pip install commands being SSL-blocked like me you can bypass it using the --trusted-host argument.