how to scrape all files in a catalog series from the national archives (archives.gov) with R how to scrape all files in a catalog series from the national archives (archives.gov) with R selenium selenium

how to scrape all files in a catalog series from the national archives (archives.gov) with R


There's no need to load and parse the page here since the records are loaded via a simple Ajax request.

To see the requests, simply monitor them via devtools and select the first one returning some JSON. Then use the jsonlite library to request the same URL with R. It will automatically parse the result.

To list all the files (description + URL) for the 153 entries:

library(jsonlite)options(timeout=60000) # increase timeout to 60sec (default is 10sec)json = fromJSON("https://catalog.archives.gov/OpaAPI/iapi/v1?action=search&f.level=fileUnit&f.parentNaId=2456161&q=*:*&offset=0&rows=10000&tabType=all")ids = json$opaResponse$results$result$naIdfor (id in ids) { # each id    json = fromJSON(sprintf("https://catalog.archives.gov/OpaAPI/iapi/v1/id/%s", id))    records = json$opaResponse$content$objects$objects$object    for (r in 1:nrow(records)) {  # each record        # prints the file description and URL        print(records[r, 'description'])        print(records[r, '@renditionBaseUrl'])    }}


If you are familiar with using httr, you might consider using the National Archives Catalog API to interact with their server. As I read that web site there is a way to query and request data directly. This way you would not have to scrape the web page.

I played around in the sandbox without an api key and got this far translating your webpage query to the api query:

https://catalog.archives.gov/api/v1?&q=*:*&resultTypes=fileUnit&parentNaId=2456161

Unfortunately, that doesn't recognize the parentNaId field name...perhaps that's a result of not having permission without an api key. In any case, I don't know R myself, so you'll have to work out how to use all of this in httr.

I hope this helps a bit.


from the folks who wrote the API over at National Archives and Records Administration..

Hi Anthony,

There's no need to scrape; NARA's catalog has an open API. If I understand right, you want to download all of the media files (what our catalog calls "objects") in all the file units in the series "Home Mortgage Disclosure Data Files" (NAID 2456161).

The API allows fielded search on any field in the data, so rather than have a search parameter like "parentNaId", the best way to do that query would be to search on that specific field, i.e., bring back all records where the parent series NAID is 2456161. If you open up one of those file units to look at the data by using the identifier (e.g. https://catalog.archives.gov/api/v1?naIds=2580657), you can see the field that contains the parent series is called "description.fileUnit.parentSeries". So, all your records file units and their objects will be in https://catalog.archives.gov/api/v1?description.fileUnit.parentSeries=2456161. If you want back just the objects without the file unit records, you can add the "&type=object" parameter. Or if you want the file unit metadata, you can also restrict the results with "type=description," since every file unit record also contains all the data for their child objects. It looks like there are over 1000 results, so you will also need to use the "rows" parameter to ask for all the results in one query, or paginate with the "offset" parameter and smaller "rows" values, since the default response is only the first 10 results.

Within the object metadata, you will field the fields with the URLs you can use to download the media, as well as other metadata that may be of interest. For example, note that some of these objects are considered electronic records, as in the original archival records from agencies, while others are NARA-created technical documentation. This is noted in the "designator" field.

Let me know if you still have any questions.

Thanks! Dominic