Trouble scraping all the books from a section without hardcoding payload Trouble scraping all the books from a section without hardcoding payload python-3.x python-3.x

Trouble scraping all the books from a section without hardcoding payload


Everything that you need in order to get the carousel data is in the initial request when you query for the product URL.

You need to get full product HTML, scoop out the carousel data and reuse parts of it to construct a valid payload that can be used in the follow-up POST requests.

However, getting the product HTML is the hardest part, at least on my end, as Amazon will either block or throw a CAPTCHA, if you request the HTML too often.

Using a proxy or VPN helps. Swapping the product URL does help sometimes too.

Summing up, the key is to get the product HTML. The subsequent requests are easy to make and are not throttled, AFAIK.

Here's how to get the data for and from the carousel:

import jsonimport reimport requestsfrom bs4 import BeautifulSoup# The chunk is how many carousel items are going to be requested for;# this can vary from 4 - 10 items, as on the web-page.# Also, the other list is used as the indexes key in the payload.def get_idx_and_indexes(carousel_ids: list, chunk: int = 5) -> iter:    for index in range(0, len(carousel_ids), chunk):        tmp = carousel_ids[index:index + chunk]        yield tmp, [carousel_ids.index(item) for item in tmp]headers = {    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) "                  "AppleWebKit/537.36 (KHTML, like Gecko) "                  "Chrome/90.0.4430.93 Safari/537.36",}product_url = 'https://www.amazon.de/Rust-Programming-Language-Covers-2018/dp/1718500440/'# Getting the product HTML as it carries all the carousel data items with requests.Session() as session:    r = session.get("https://www.amazon.com", headers=headers)    page = session.get(product_url, headers=headers)# This is where the carousel data sits along with all the items needed to make# the following requests e.g. items, acp-params, linkparameters, marketplaceid etc.initial_soup = BeautifulSoup(    re.search(r"<!--CardsClient-->(.*)<input", page.text).group(1),    "lxml",).find_all("div")# Preparing all the details for subsequent requests to carousel_endpointitem_ids = json.loads(initial_soup[3]["data-a-carousel-options"])["ajax"]["id_list"]payload = {    "aAjaxStrategy": "promise",    "aCarouselOptions": initial_soup[3]["data-a-carousel-options"],    "aDisplayStrategy": "swap",    "aTransitionStrategy": "swap",    "faceoutkataname": "GeneralFaceout",    "faceoutspecs": "{}",    "individuals": "0",    "language": "en-US",    "linkparameters": initial_soup[0]["data-acp-tracking"],    "marketplaceid": initial_soup[3]["data-marketplaceid"],    "name": "p13n-sc-shoveler_hgm4oj1hneo",  # this changes by can be ignored    "offset": "6",    "reftagprefix": "pd_sim",}headers.update(    {        "x-amz-acp-params": initial_soup[0]["data-acp-params"],        "x-requested-with": "XMLHttpRequest",    })# looping through the carousel data and performing requestscarousel_endpoint = " https://www.amazon.com/acp/p13n-desktop-carousel/funjjvdbohwkuezi/getCarouselItems"for ids, indexes in get_idx_and_indexes(item_ids):    payload["ids"] = ids    payload["indexes"] = indexes    # The actual carousel data    response = session.post(carousel_endpoint, json=payload, headers=headers)    carousel = BeautifulSoup(response.text, "lxml").find_all("a")    print("\n".join(a.getText() for a in carousel))

This should output:

Cracking the Coding Interview: 189 Programming Questions and SolutionsGayle Laakmann McDowell4.7 out of 5 starsâ4,864#1 Best Sellerin Computer Hacking$24.00Container Security: Fundamental Technology Concepts that Protect Containerized ApplicationsLiz Rice4.7 out of 5 starsâ102$35.42Linux BibleChristopher Negus4.8 out of 5 starsâ245#1 Best Sellerin Linux Servers$31.99System Design Interview â An insider's guide, Second EditionAlex Xu4.5 out of 5 starsâ568#1 Best Sellerin Bioinformatics$24.99Ansible for DevOps: Server and configuration management for humansJeff Geerling4.6 out of 5 starsâ127$17.35Effective C: An Introduction to Professional C ProgrammingRobert C. Seacord4.5 out of 5 starsâ94$32.99Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent SystemsAurélien Géron4.8 out of 5 starsâ1,954#1 Best Sellerin Computer Neural Networks$32.93Head First Design Patterns: Building Extensible and Maintainable Object-Oriented SoftwareEric Freeman4.7 out of 5 starsâ67$41.45Fluent Python: Clear, Concise, and Effective ProgrammingLuciano Ramalho4.6 out of 5 starsâ52354 offers from $32.24TCP/IP Illustrated, Volume 1: The Protocols (Addison-Wesley Professional Computing Series)4.6 out of 5 starsâ199$63.26Operating Systems: Three Easy Pieces4.7 out of 5 starsâ224#1 Best Sellerin Computer Operating Systems Theory$24.61Software Engineering at Google: Lessons Learned from Programming Over TimeTitus Winters4.6 out of 5 starsâ243$44.52and so on ...