Scraping data from a site where URL doesn't change on clicking 'Show More' Scraping data from a site where URL doesn't change on clicking 'Show More' selenium selenium

Scraping data from a site where URL doesn't change on clicking 'Show More'


Here is a mimic of a POST request using API info as I see in network tab. I have stripped back to headers that seems to be required.

import requestsurl = 'https://samizdat-graphql.nytimes.com/graphql/v2'headers = {         'nyt-app-type': 'project-vi',         'nyt-app-version': '0.0.3',         'nyt-token': 'MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAlYOpRoYg5X01qAqNyBDM32EI/E77nkFzd2rrVjhdi/VAZfBIrPayyYykIIN+d5GMImm3wg6CmTTkBo7ixmwd7Xv24QSDpjuX0gQ1eqxOEWZ0FHWZWkh4jfLcwqkgKmfHJuvOctEiE/Wic5Qrle323SMDKF8sAqClv8VKA8hyrXHbPDAlAaxq3EPOGjJqpHEdWNVg2S0pN62NSmSudT/ap/BqZf7FqsI2cUxv2mUKzmyy+rYwbhd8TRgj1kFprNOaldrluO4dXjubJIY4qEyJY5Dc/F03sGED4AiGBPVYtPh8zscG64yJJ9Njs1ReyUCSX4jYmxoZOnO+6GfXE0s2xQIDAQAB'}data = '''{"operationName":"SearchRootQuery","variables":{"first":10,"sort":"best","beginDate":"20180101","text":"trump","cursor":"YXJyYXljb25uZWN0aW9uOjk="},"extensions":{"persistedQuery":{"version":1,"sha256Hash":"d2895d5a5d686528b9b548f018d7d0c64351ad644fa838384d94c35c585db813"}}}'''with requests.Session() as r:    re = r.post(url, headers = headers, data = data)    print(re.json())


To scrape all the article links i.e the href attributes from the URL clicking on the link with text as SHOW MORE you can use the following solution:

from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.common.exceptions import TimeoutExceptionoptions = webdriver.ChromeOptions() options.add_argument("start-maximized")options.add_argument('disable-infobars')driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')driver.get("https://www.nytimes.com/search?%20endDate=20181231&query=trump&sort=best&startDate=20180101")myLength = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]"))))while True:    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")    try:        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Show More']"))).click()        WebDriverWait(driver, 20).until(lambda driver: len(driver.find_elements_by_xpath("//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]")) > myLength)        titles = driver.find_elements_by_xpath("//main[@id='site-content']//figure[@class='css-rninck toneNews']//following::a[1]")        myLength = len(titles)    except TimeoutException:        breakfor title in titles:    print(title.get_attribute("href"))driver.quit()


Seems like your target resource give us a nice API for their articles.

It will be much easier to use it instead of selenium.

You can open that page in Chrome. Then open Dev Tools -> Network. Click on "Show more" and you can see API request named v2 (looks like it is GraphQL gateway).

Something like

{    "operationName":"SearchRootQuery",    "variables":{        "first":10,        "sort":"best",        "beginDate":"20180101",        "endDate":"20181231",        "text":"trump" ...}}

You can mimic that request but ask as many "first" articles as you want.

EDIT:

You can right-click in DevTools and select "copy as cURL". Then paste it to your terminal. So you can see how it works.

After that you can use library like requests to do it from your code.