Scraping leaderboard table on golf website in R Scraping leaderboard table on golf website in R selenium selenium

Scraping leaderboard table on golf website in R


As already mentioned, this page is dynamically generated by some javascript.
Even the json file address seems to be dynamic, and the address you're trying to open isn't valid anymore :

https://lbdata.pgatour.com/2021/r/003/leaderboard.json?userTrackingId=exp=1612495792~acl=*~hmac=722f704283f795e8121198427386ee075ce41e93d90f8979fd772b223ea11ab9An error occurred while processing your request.Reference #199.cf05d517.1613439313.4ed8cf21 

To get the data, you could use RSelenium after installing a Docker Selenium server.
The installation is straight forward, and Docker is designed to make images work out of the box.

After Docker installation, running the Selenium server is as simple as:

docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.0

Note that this as a whole requires over 2 Gb disk space.

Selenium emulates a Web browser and allows among others to get the final HTML content of the page, after rendering of the javascript:

library(RSelenium)library(rvest)remDr <- remoteDriver(  remoteServerAddr = "localhost",  port = 4445L,  browserName = "firefox")# Open connexion to Selenium serverremDr$open()remDr$getStatus()remDr$navigate("https://www.pgatour.com/leaderboard.html")players <- xml2::read_html(remDr$getPageSource()[[1]]) %>%                  html_nodes(".player-name-col")   %>%                  html_text()total <- xml2::read_html(remDr$getPageSource()[[1]]) %>%                html_nodes(".total") %>%               html_text()data.frame(players = players, total = total[-1])                     players total1        Daniel Berger  (PB)   -182     Maverick McNealy  (PB)   -163      Patrick Cantlay  (PB)   -154        Jordan Spieth  (PB)   -155           Paul Casey  (PB)   -146         Nate Lashley  (PB)   -147      Charley Hoffman  (PB)   -138     Cameron Tringale  (PB)   -13...

As the table doesn't use the table tag, html_table doesn't work and columns need to be extracted individually.


This page uses javascript to render the page and the data you seek is stored as a JSON file. Using the developer tools from your browser and looking on the Network tab you should be able to find the link to the "leaderboard.json" file.

You can access this file directly:

jsonlite::fromJSON("https://lbdata.pgatour.com/2021/r/003/leaderboard.json?userTrackingId=exp=1612495792~acl=*~hmac=722f704283f795e8121198427386ee075ce41e93d90f8979fd772b223ea11ab9")

It is a pretty complicated list but you should be able to navigate through the different elements to find the desired information.