Scraping data from a site where URL doesn't change on clicking 'Show More'

python selenium web-scraping lazy-loading webdriverwait

Here is a mimic of a POST request using API info as I see in network tab. I have stripped back to headers that seems to be required.

import requestsurl = 'https://samizdat-graphql.nytimes.com/graphql/v2'headers = {         'nyt-app-type': 'project-vi',         'nyt-app-version': '0.0.3',         'nyt-token': 'MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAlYOpRoYg5X01qAqNyBDM32EI/E77nkFzd2rrVjhdi/VAZfBIrPayyYykIIN+d5GMImm3wg6CmTTkBo7ixmwd7Xv24QSDpjuX0gQ1eqxOEWZ0FHWZWkh4jfLcwqkgKmfHJuvOctEiE/Wic5Qrle323SMDKF8sAqClv8VKA8hyrXHbPDAlAaxq3EPOGjJqpHEdWNVg2S0pN62NSmSudT/ap/BqZf7FqsI2cUxv2mUKzmyy+rYwbhd8TRgj1kFprNOaldrluO4dXjubJIY4qEyJY5Dc/F03sGED4AiGBPVYtPh8zscG64yJJ9Njs1ReyUCSX4jYmxoZOnO+6GfXE0s2xQIDAQAB'}data = '''{"operationName":"SearchRootQuery","variables":{"first":10,"sort":"best","beginDate":"20180101","text":"trump","cursor":"YXJyYXljb25uZWN0aW9uOjk="},"extensions":{"persistedQuery":{"version":1,"sha256Hash":"d2895d5a5d686528b9b548f018d7d0c64351ad644fa838384d94c35c585db813"}}}'''with requests.Session() as r:    re = r.post(url, headers = headers, data = data)    print(re.json())

python selenium web-scraping lazy-loading webdriverwait

To scrape all the article links i.e the href attributes from the URL clicking on the link with text as SHOW MORE you can use the following solution:

from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.common.exceptions import TimeoutExceptionoptions = webdriver.ChromeOptions() options.add_argument("start-maximized")options.add_argument('disable-infobars')driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')driver.get("https://www.nytimes.com/search?%20endDate=20181231&query=trump&sort=best&startDate=20180101")myLength = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]"))))while True:    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")    try:        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Show More']"))).click()        WebDriverWait(driver, 20).until(lambda driver: len(driver.find_elements_by_xpath("//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]")) > myLength)        titles = driver.find_elements_by_xpath("//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]")        myLength = len(titles)    except TimeoutException:        breakfor title in titles:    print(title.get_attribute("href"))driver.quit()

python selenium web-scraping lazy-loading webdriverwait

Seems like your target resource give us a nice API for their articles.

It will be much easier to use it instead of selenium.

You can open that page in Chrome. Then open Dev Tools -> Network. Click on "Show more" and you can see API request named v2 (looks like it is GraphQL gateway).

Something like

{    "operationName":"SearchRootQuery",    "variables":{        "first":10,        "sort":"best",        "beginDate":"20180101",        "endDate":"20181231",        "text":"trump" ...}}

You can mimic that request but ask as many "first" articles as you want.

EDIT:

You can right-click in DevTools and select "copy as cURL". Then paste it to your terminal. So you can see how it works.

After that you can use library like requests to do it from your code.

CodeHunter

Scraping data from a site where URL doesn't change on clicking 'Show More'

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last