Can't scrape product title from a webpage

python python-3.x selenium web-scraping python-requests

This is only a placeholder for research that might be useful to others looking at this Cloudflare bypass issue.

Use Case

Scraping information from a website that is using either Cloudflare CAPTCHA or Javascript challenge for enhanced protection.

Python Requests

Using a standard Python Requests.Get the Cloudflare service will return a 403 Forbidden error code.

import requestsURL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \      '-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}response = requests.get(URL, headers=headers)print(f'Status Code: {response.status_code}')print(f'Status Code Reason: {response.reason}')# outputStatus Code: 403Status Code Reason: Forbidden

If we look at the response.headers we can see that a Cloudflare server is proxying our request to the target URL.

...continued from the code abovefor key, value in response.headers.items():    print(f'KEY NAME: {key}')    print(f'KEY VALUE: {value}')    print('-----------------------')    # output     KEY NAME: Date    KEY VALUE: Sun, 13 Jun 2021 16:39:03 GMT    -----------------------    KEY NAME: Content-Type    KEY VALUE: text/html; charset=UTF-8    -----------------------    KEY NAME: Transfer-Encoding    KEY VALUE: chunked    -----------------------    KEY NAME: Connection    KEY VALUE: close    -----------------------    KEY NAME: Permissions-Policy    KEY VALUE: accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()    -----------------------    KEY NAME: Cache-Control    KEY VALUE: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0    -----------------------    KEY NAME: Expires    KEY VALUE: Thu, 01 Jan 1970 00:00:01 GMT    -----------------------    KEY NAME: X-Frame-Options    KEY VALUE: SAMEORIGIN    -----------------------    KEY NAME: cf-request-id    KEY VALUE: 0aa7d6c7c4000007ff7201b000000001    -----------------------    KEY NAME: Expect-CT    KEY VALUE: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"    -----------------------    KEY NAME: Set-Cookie    KEY VALUE: __cf_bm=72427e2af66c7177feeb88a847fae9c26b66c681-1623602343-1800-AZAmqDfaHZU8IXOH/i3BBVf8pGcws0Gc1Tln5yKUepe3utWlCpagxvALDW6wiHd2pli9Zl45Mg8gC/QSoUFhoes=; path=/; expires=Sun, 13-Jun-21 17:09:03 GMT; domain=.cclonline.com; HttpOnly; Secure; SameSite=None    -----------------------    KEY NAME: Vary    KEY VALUE: Accept-Encoding    -----------------------    KEY NAME: Server    KEY VALUE: cloudflare    -----------------------    KEY NAME: CF-RAY    KEY VALUE: 65ecc0b9383b07ff-ATL    -----------------------    KEY NAME: Content-Encoding    KEY VALUE: gzip    -----------------------

If we look at the response.text associated with the Python Requests we can see other evidence related to the Cloudflare protection.

...continued from the code aboveprint(response.text)# outputtruncated...<title>Please Wait... | Cloudflare</title><meta name="captcha-bypass" id="captcha-bypass" />truncated...<form class="challenge-form managed-form" id="challenge-form" action="/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/?__cf_chl_managed_tk__=7d4597196bb14948881846ca16631b64c55f06d3-1623602854-0-AcX2yHJM2sCalL03Opq9RiFjASeYE0Xs0KG4XeG1lezzhzEyu-bL8xsdHuEjNIIKaJkWEmha4DhViRlqWEP_HREOdA8YAY7nnNkBAHbNMs6p_AWgYNLPnSNM13PO2I96hdABtoaaKjOzV4AyJQJ8f08XEW2flN97rPxIMeiR0tI1a3PiON2dN9E_YCyneAuCUfaYWUNGL0Bqd_rkYp3Ljb2zk_kGWizckr1fvhodSEjEB-ByYVK8ODNox2oZ4XPcmCYJ6UNDmbNc406BjMeTf3e72Z7vgdnt3V714VrGN4w_Y4VQ2X1V0OVKUKEH9B5Rxa_4fEZiMAAdxZ6idg69JYMKftuuLemr53n5WAwTwyX2G7N9jmjtarxEQcCqoj9oY7oSFwQTb3ZVb9i5EeavKaE1_67wxpyPybNidBDxhLazDEMefPZGDsV9mSziuIQ90nS5vn-7sUvC8BJATNWPbh6OduchXy-QcMeYhurtukUCm3oDQMP7r4g4qvDCWI3_-ku7u-B4G2XI2kwM_tLVEZiH5uHPjWpHE6eFWohiCTxd4p7vHg7z5ug9feRalYqu3GfInd82GZ-j-7nCqLDmPh2Sjlu6sJGfopqM3XlBrd1kgRZU3Z4uw6JIIqfH0M6K3_weTtem0-Z1zhDUBbVDvgJVeHNNh_bTxHGWbFB0f80tALBMbt67RftO5u1XBUZ-TRftteXBwJ8gmYzOZTo4lQOGQ_771urYXsTuW_sp8PwxvQpEyCnY8zD8dmVz0-waZhOet8MQMwduN2nfGUOrCMwUYO9McsBqzfsT5PJZVkDm-rYBBwqw0PIwvm1-N8ymAjrpSN6ps4FerqK1uQOo77FLiOq8JCOVqdETIZ9NO07A" method="POST" enctype="application/x-www-form-urlencoded">truncated... <input type="hidden" name="r" value="d5db3eb87c9b42ec7f076916611c296abfd2c842-1623602854-0-AXz7+uyFGbpY1aOLgfZMm0oIiiepEo5I5QmdTnvMmL9fDUc4OMEa2CNYXsbHVjOzdYO+PqegjpNL8R3D9LhDc+Xo0y0ira1zO7foozPj0qdcUpNNr2ZOHqgUyKws6dVgeBNUdF+v9+eNFxSHxOhc4DWDLIw9guBqJg1GaBjG3QCQdZmyFbPxXUQtXTFmtVVuqch9qBFLa/u9deMBCxCWi5fyKoOINtyBtyT4p79ITb9T+6T7fl2epMXNHO6xBW2dPnDP1FmjUQ04CG3ydOaDS5qoSFMPr4InVbMcI2NbQYJYPfWjmncMaga6K+NMNvv8wtiyXpEeWsUgFFeQoDJEuvLI+wkI8mT+vXAnXd8LWy9TpEDVK6uxtLF2C75aU7qJxI9RKANGluWYUXeqE1tXgppgZraIGfRWNPVsQZzqd6SK+Zsg8x8UH7oRRD9blMMPMaekcFQ3zT8QQ5BzEc8wEQ68OhmKbFuAeV/YhhWshpm808gcVHIFH17I+0MEidfV/ny5wBSRZJyQUfOSU9iAv/minNWF6ZA21E/+Zebda2lVF6gyEHgrjecxuOxzY2I2qMm0RCEHO4oSk/X8EtMYirGCQ3FD8PzSvZYx+34QZutXFLVvqT3CR/UcsXybG6wllvIGvZ6j/gdoAwfcS27MyO4mXDMk6TfDqdi+NqlItwgWNdp461RQmPdChRp9kKEy3sTsIAGW9Ky1k/xYYcTvLDpCGFICBEm2JhDyp/FEF9UBYia7XJ4aUEncSUeViqaQ8bXpPk6kEPH5RYEcfaX3he0W5aZHHIGcjgOFZsuu45MWREvbHjO+RcPMib4L+lU1cKQoYx+w5b9e4AJiRnGog3a6E3i/L75bSnk7L3qA+DofeeccI/RPitqDb/lX31fkhwHfdRWoLt+OILsUfHNni/olGABEUDruwDVpR32xlieS7vekdmQL3oOu5BkAOXoObbb+2nzo6Dvgw7M7rb4muC7US4yCTK0BeGSfu2XvFta228IoGIGa8BjUcb09K6nRdWUwrCXLYS+vIJTegKMeyxlMKNXw7vIaPh9vht4zblhN0bqkN/m/opyXEtzLfhsLuEkHdQ0GhTUk2nYgHeKX0j6eW0uQhAD/9TLf6UgILCk0+nQvXfEffQCCe/hEfBfkAgiPhr1E3uyPB4vp6Fpy2nnkkzmGv/3P5wg6afKDmU2Ic32u3U47hOlghnc7NlbzFb5R8Tx6vWrkXMDYHdOaaudLtPp5N9y1ceXXaMNAFMVmoqaiHWuV4KN+2rLolSOGUEFNEoRN6Jw9mlq/zniK23gQ2lSy+wIHPRGvRCxhRr5DeskvLgyviAk7IhLH3zMpqxd7i05BIPV3sB8orBzVE4Rqmam3evpTVEMMFRDt/Ol6XUJi66QrLgJyusuv5xL4pKPWZrw/hn3a5j0zrrChUbvM3S94BeWiJS48hA35S9mXLfaKMAZTYZTMqhbW77qwUuquwW2lPEAgSPY7WvvnNRUPXsS1KCPpiuE0TuDFaZQi9UTqlzkQIq84wqVRjQZ0Y0m3PQeI2BbJZ8woKIKiABWbSOuV/kyy5H4L+RVL7Jmc2ndl3HaQ4XlnwDmTuK/gMbRvZe1taVHOyYsXmfEY4XkiaDUneGjBEGnWyiv49DtiG2TLmmIpP1UITmO677eDSoNLHpxp1guMjwL5m3XHKOFNtpLzuiVH4UJdgTjtnmbGHmKGtyy0k3GPZrwyVkZRyS+FZZ5WhTs05rhS+1sg3oDCyTbWeYX9T4VVswRjxq1HsyH8NdZTN4f9BTn9VU0+9JnVAkgLM4JCkV6wqwQf+QMK/MaYWvBwSjYgFUxdEdT7Rls85/M+4GxcaGsiNmsA5Q==">  <input type="hidden" name="cf_captcha_kind" value="h">  <input type="hidden" name="vc" value="4845a44c225a1fa6a61708e11b613971">truncated... <script type="text/javascript">    //<![CDATA[    (function(){        var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent);        var trkjs = isIE ? new Image() : document.createElement('img');        trkjs.setAttribute("src", "/cdn-cgi/images/trace/managed/js/transparent.gif?ray=65eccd326d61f331");        trkjs.id = "trk_managed_js";        trkjs.setAttribute("alt", "");        document.body.appendChild(trkjs);        var cpo=document.createElement('script');        cpo.type='text/javascript';        cpo.src="/cdn-cgi/challenge-platform/h/g/orchestrate/managed/v1?ray=65eccd326d61f331";        document.getElementsByTagName('head')[0].appendChild(cpo);    }());    //]]>    </script>

The information above shows that the Python Requests that was transmitted to the target URL was intercepted by a Cloudflare server, which is challenging the request. This challenge has to be bypassed before the initial request will be allowed to continue.

cfscrape Package

The OP stated that they attempted to use the cfscrape Python Package to obtain token information from the Cloudflare server.

A standard cfscrape request provide identical responses as Python Requests.

import cfscrapeURL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \      '-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}scraper = cfscrape.create_scraper(delay=10)response = scraper.get(URL, headers=headers)print(f'Status Code: {response.status_code}')print(f'Status Code Reason: {response.reason}')# outputStatus Code: 403Status Code Reason: Forbidden

The cfscrape package also supports the functions get_tokens and get_cookie_string, but both of these produce the 403 Forbidden error code.

From cfscrape source code:

def is_cloudflare_captcha_challenge(resp):        return (            resp.status_code == 403            and resp.headers.get("Server", "").startswith("cloudflare")            and b"/cdn-cgi/l/chk_captcha" in resp.content        )# the function above is called from thisdef request(self, method, url, *args, **kwargs):        resp = super(CloudflareScraper, self).request(method, url, *args, **kwargs)        # Check if Cloudflare captcha challenge is presented        if self.is_cloudflare_captcha_challenge(resp):            self.handle_captcha_challenge(resp, url)        # Check if Cloudflare anti-bot "I'm Under Attack Mode" is enabled        if self.is_cloudflare_iuam_challenge(resp):            resp = self.solve_cf_challenge(resp, **kwargs)        return resp

The handle_captcha_challenge function is what tries to solve the Cloudflare javascript challenge. This section of the code is what is failing. It's unclear what part of that section is failing, so additional research and testing is required.

PLEASE NOTE: According to the package's developer the module is no longer supported.

cloudscraper Package

The OP also stated that they attempted to use the cloudscraper Python Package to obtain token information from the Cloudflare server. It is worth nothing that cloudscraper was forked from cfscrape, so the syntax is similar.

cloudscraper gets the same 403 Forbidden error code as cfscrape.

import cloudscraperURL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \      '-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}scraper = cloudscraper.create_scraper()response = scraper.get(URL)print(f'Status Code: {response.status_code}')print(f'Status Code Reason: {response.reason}')# outputStatus Code: 403Status Code Reason: Forbidden

The cloudscraper package also supports the functions get_tokens and get_cookie_string, but both of these produce the 403 Forbidden error code.

selenium Package

The OP also stated that they attempted to use the selenium Python package.

SPECIAL NOTE: During my testing I used selenium with webdrivers for Google Chrome, Mozilla Firefox and Microsoft Edge.

Within the last 12 months these Options could be used in selenium to bypass Cloudflare protection. Unfortunately, these Options do not work today

chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])chrome_options.add_experimental_option('useAutomationExtension', False)# additional disable-blink-features are available in Chromium source code on Githubchrome_options.add_argument("--disable-blink-features=AutomationControlled")

Below is a selenium code example using the Chrome webdriver with the switches above.

from selenium import webdriverchrome_options = webdriver.ChromeOptions()chrome_options.add_argument(    "user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")chrome_options.add_argument("start-maximized")chrome_options.add_argument("--disable-blink-features=AutomationControlled")chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])chrome_options.add_experimental_option('useAutomationExtension', False)driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options)URL = "https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934"driver.get(URL)

The code above opens a browser session, which is confronted with a Cloudflare Javascript challenge. During testing with the switches mentioned above this challenge does not stop. The Cloudflare Ray ID, which are unique id per request rotate many times before I manually terminated the session.

seleniumwire is required to obtain the status code

Below is a headless mode Chrome webdriver session, which also shows the 403 Forbidden error code for the target URL. The session also shows that hcaptcha.com anti-bot technology is now in the mix.

from seleniumwire import webdriverchrome_options = webdriver.ChromeOptions()chrome_options.add_argument("start-maximized")chrome_options.add_argument("--headless")chrome_options.add_argument(    "user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])chrome_options.add_experimental_option('useAutomationExtension', False)chrome_options.add_argument("--disable-blink-features=AutomationControlled")driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options)URL = "https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934"driver.get(URL)for request in driver.requests:    print(f'Status Code: {request.response}')    print(f'Host Name: {request.host}')    # output     Status Code: 403     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 302     Host Name: hcaptcha.com    -----------------------    Status Code: 200     Host Name: newassets.hcaptcha.com    -----------------------driver.quit()

A standard Chrome webdriver session using the UI shows an iFrame with an "I am human" checkbox.

If I click the button manually or with selenium session, I'm prompted with a picture captcha, which increasing the complexity of bypassing the Cloudflare protection.

cf_clearance cookie

When a Cloudflare CAPTCHA or Javascript challenge is solved a cf_clearance cookie is set in the client browser. The cf_clearance cookie has a default lifetime of 30 minutes, but is configurable by the Cloudflare client.

If you open the OP's target URL manually in a Google Chrome browser you can see the cf_clearance cookie using Developer Tools

It seem that the cf_clearance cookie lifetime is set for 60 minutes based on the UTC time this session started and the expiration date set for the cookie.

So far I haven't found a way to extract this cookie using Python.

python python-3.x selenium web-scraping python-requests

Something you need in header for request!

Cookie "cf_clearance"
User-agent

Sample

Steps to get cookies

Open chrome devtools
Switch to tab "Network"
Copy request header

import requestsfrom bs4 import BeautifulSouplink = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'h = '''cookie: cf_clearance=718abb68f064be7612ee987ab9d8bc755016f3c2-1623437208-0-150user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4539.2 Safari/537.36'''h = dict(l.split(': ') for l in h.split('\n') if ': ' in l)res = requests.get(link, headers=h)soup = BeautifulSoup(res.text, "lxml")try:    product_title = soup.select_one("h1 > span").get_text(strip=True)except AttributeError:    product_title = ""print(product_title)

CodeHunter

Can't scrape product title from a webpage

Use Case

Python Requests

cfscrape Package

cloudscraper Package

selenium Package

cf_clearance cookie

Something you need in header for request!

Steps to get cookies

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last