Can't scrape product title from a webpage Can't scrape product title from a webpage selenium selenium

Can't scrape product title from a webpage


This is only a placeholder for research that might be useful to others looking at this Cloudflare bypass issue.

Use Case


Scraping information from a website that is using either Cloudflare CAPTCHA or Javascript challenge for enhanced protection.

Python Requests


Using a standard Python Requests.Get the Cloudflare service will return a 403 Forbidden error code.

import requestsURL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \      '-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}response = requests.get(URL, headers=headers)print(f'Status Code: {response.status_code}')print(f'Status Code Reason: {response.reason}')# outputStatus Code: 403Status Code Reason: Forbidden

If we look at the response.headers we can see that a Cloudflare server is proxying our request to the target URL.

...continued from the code abovefor key, value in response.headers.items():    print(f'KEY NAME: {key}')    print(f'KEY VALUE: {value}')    print('-----------------------')    # output     KEY NAME: Date    KEY VALUE: Sun, 13 Jun 2021 16:39:03 GMT    -----------------------    KEY NAME: Content-Type    KEY VALUE: text/html; charset=UTF-8    -----------------------    KEY NAME: Transfer-Encoding    KEY VALUE: chunked    -----------------------    KEY NAME: Connection    KEY VALUE: close    -----------------------    KEY NAME: Permissions-Policy    KEY VALUE: accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()    -----------------------    KEY NAME: Cache-Control    KEY VALUE: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0    -----------------------    KEY NAME: Expires    KEY VALUE: Thu, 01 Jan 1970 00:00:01 GMT    -----------------------    KEY NAME: X-Frame-Options    KEY VALUE: SAMEORIGIN    -----------------------    KEY NAME: cf-request-id    KEY VALUE: 0aa7d6c7c4000007ff7201b000000001    -----------------------    KEY NAME: Expect-CT    KEY VALUE: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"    -----------------------    KEY NAME: Set-Cookie    KEY VALUE: __cf_bm=72427e2af66c7177feeb88a847fae9c26b66c681-1623602343-1800-AZAmqDfaHZU8IXOH/i3BBVf8pGcws0Gc1Tln5yKUepe3utWlCpagxvALDW6wiHd2pli9Zl45Mg8gC/QSoUFhoes=; path=/; expires=Sun, 13-Jun-21 17:09:03 GMT; domain=.cclonline.com; HttpOnly; Secure; SameSite=None    -----------------------    KEY NAME: Vary    KEY VALUE: Accept-Encoding    -----------------------    KEY NAME: Server    KEY VALUE: cloudflare    -----------------------    KEY NAME: CF-RAY    KEY VALUE: 65ecc0b9383b07ff-ATL    -----------------------    KEY NAME: Content-Encoding    KEY VALUE: gzip    -----------------------

If we look at the response.text associated with the Python Requests we can see other evidence related to the Cloudflare protection.

...continued from the code aboveprint(response.text)# outputtruncated...<title>Please Wait... | Cloudflare</title><meta name="captcha-bypass" id="captcha-bypass" />truncated...<form class="challenge-form managed-form" id="challenge-form" action="/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/?__cf_chl_managed_tk__=7d4597196bb14948881846ca16631b64c55f06d3-1623602854-0-AcX2yHJM2sCalL03Opq9RiFjASeYE0Xs0KG4XeG1lezzhzEyu-bL8xsdHuEjNIIKaJkWEmha4DhViRlqWEP_HREOdA8YAY7nnNkBAHbNMs6p_AWgYNLPnSNM13PO2I96hdABtoaaKjOzV4AyJQJ8f08XEW2flN97rPxIMeiR0tI1a3PiON2dN9E_YCyneAuCUfaYWUNGL0Bqd_rkYp3Ljb2zk_kGWizckr1fvhodSEjEB-ByYVK8ODNox2oZ4XPcmCYJ6UNDmbNc406BjMeTf3e72Z7vgdnt3V714VrGN4w_Y4VQ2X1V0OVKUKEH9B5Rxa_4fEZiMAAdxZ6idg69JYMKftuuLemr53n5WAwTwyX2G7N9jmjtarxEQcCqoj9oY7oSFwQTb3ZVb9i5EeavKaE1_67wxpyPybNidBDxhLazDEMefPZGDsV9mSziuIQ90nS5vn-7sUvC8BJATNWPbh6OduchXy-QcMeYhurtukUCm3oDQMP7r4g4qvDCWI3_-ku7u-B4G2XI2kwM_tLVEZiH5uHPjWpHE6eFWohiCTxd4p7vHg7z5ug9feRalYqu3GfInd82GZ-j-7nCqLDmPh2Sjlu6sJGfopqM3XlBrd1kgRZU3Z4uw6JIIqfH0M6K3_weTtem0-Z1zhDUBbVDvgJVeHNNh_bTxHGWbFB0f80tALBMbt67RftO5u1XBUZ-TRftteXBwJ8gmYzOZTo4lQOGQ_771urYXsTuW_sp8PwxvQpEyCnY8zD8dmVz0-waZhOet8MQMwduN2nfGUOrCMwUYO9McsBqzfsT5PJZVkDm-rYBBwqw0PIwvm1-N8ymAjrpSN6ps4FerqK1uQOo77FLiOq8JCOVqdETIZ9NO07A" method="POST" enctype="application/x-www-form-urlencoded">truncated... <input type="hidden" name="r" value="d5db3eb87c9b42ec7f076916611c296abfd2c842-1623602854-0-AXz7+uyFGbpY1aOLgfZMm0oIiiepEo5I5QmdTnvMmL9fDUc4OMEa2CNYXsbHVjOzdYO+PqegjpNL8R3D9LhDc+Xo0y0ira1zO7foozPj0qdcUpNNr2ZOHqgUyKws6dVgeBNUdF+v9+eNFxSHxOhc4DWDLIw9guBqJg1GaBjG3QCQdZmyFbPxXUQtXTFmtVVuqch9qBFLa/u9deMBCxCWi5fyKoOINtyBtyT4p79ITb9T+6T7fl2epMXNHO6xBW2dPnDP1FmjUQ04CG3ydOaDS5qoSFMPr4InVbMcI2NbQYJYPfWjmncMaga6K+NMNvv8wtiyXpEeWsUgFFeQoDJEuvLI+wkI8mT+vXAnXd8LWy9TpEDVK6uxtLF2C75aU7qJxI9RKANGluWYUXeqE1tXgppgZraIGfRWNPVsQZzqd6SK+Zsg8x8UH7oRRD9blMMPMaekcFQ3zT8QQ5BzEc8wEQ68OhmKbFuAeV/YhhWshpm808gcVHIFH17I+0MEidfV/ny5wBSRZJyQUfOSU9iAv/minNWF6ZA21E/+Zebda2lVF6gyEHgrjecxuOxzY2I2qMm0RCEHO4oSk/X8EtMYirGCQ3FD8PzSvZYx+34QZutXFLVvqT3CR/UcsXybG6wllvIGvZ6j/gdoAwfcS27MyO4mXDMk6TfDqdi+NqlItwgWNdp461RQmPdChRp9kKEy3sTsIAGW9Ky1k/xYYcTvLDpCGFICBEm2JhDyp/FEF9UBYia7XJ4aUEncSUeViqaQ8bXpPk6kEPH5RYEcfaX3he0W5aZHHIGcjgOFZsuu45MWREvbHjO+RcPMib4L+lU1cKQoYx+w5b9e4AJiRnGog3a6E3i/L75bSnk7L3qA+DofeeccI/RPitqDb/lX31fkhwHfdRWoLt+OILsUfHNni/olGABEUDruwDVpR32xlieS7vekdmQL3oOu5BkAOXoObbb+2nzo6Dvgw7M7rb4muC7US4yCTK0BeGSfu2XvFta228IoGIGa8BjUcb09K6nRdWUwrCXLYS+vIJTegKMeyxlMKNXw7vIaPh9vht4zblhN0bqkN/m/opyXEtzLfhsLuEkHdQ0GhTUk2nYgHeKX0j6eW0uQhAD/9TLf6UgILCk0+nQvXfEffQCCe/hEfBfkAgiPhr1E3uyPB4vp6Fpy2nnkkzmGv/3P5wg6afKDmU2Ic32u3U47hOlghnc7NlbzFb5R8Tx6vWrkXMDYHdOaaudLtPp5N9y1ceXXaMNAFMVmoqaiHWuV4KN+2rLolSOGUEFNEoRN6Jw9mlq/zniK23gQ2lSy+wIHPRGvRCxhRr5DeskvLgyviAk7IhLH3zMpqxd7i05BIPV3sB8orBzVE4Rqmam3evpTVEMMFRDt/Ol6XUJi66QrLgJyusuv5xL4pKPWZrw/hn3a5j0zrrChUbvM3S94BeWiJS48hA35S9mXLfaKMAZTYZTMqhbW77qwUuquwW2lPEAgSPY7WvvnNRUPXsS1KCPpiuE0TuDFaZQi9UTqlzkQIq84wqVRjQZ0Y0m3PQeI2BbJZ8woKIKiABWbSOuV/kyy5H4L+RVL7Jmc2ndl3HaQ4XlnwDmTuK/gMbRvZe1taVHOyYsXmfEY4XkiaDUneGjBEGnWyiv49DtiG2TLmmIpP1UITmO677eDSoNLHpxp1guMjwL5m3XHKOFNtpLzuiVH4UJdgTjtnmbGHmKGtyy0k3GPZrwyVkZRyS+FZZ5WhTs05rhS+1sg3oDCyTbWeYX9T4VVswRjxq1HsyH8NdZTN4f9BTn9VU0+9JnVAkgLM4JCkV6wqwQf+QMK/MaYWvBwSjYgFUxdEdT7Rls85/M+4GxcaGsiNmsA5Q==">  <input type="hidden" name="cf_captcha_kind" value="h">  <input type="hidden" name="vc" value="4845a44c225a1fa6a61708e11b613971">truncated... <script type="text/javascript">    //<![CDATA[    (function(){        var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent);        var trkjs = isIE ? new Image() : document.createElement('img');        trkjs.setAttribute("src", "/cdn-cgi/images/trace/managed/js/transparent.gif?ray=65eccd326d61f331");        trkjs.id = "trk_managed_js";        trkjs.setAttribute("alt", "");        document.body.appendChild(trkjs);        var cpo=document.createElement('script');        cpo.type='text/javascript';        cpo.src="/cdn-cgi/challenge-platform/h/g/orchestrate/managed/v1?ray=65eccd326d61f331";        document.getElementsByTagName('head')[0].appendChild(cpo);    }());    //]]>    </script>  

The information above shows that the Python Requests that was transmitted to the target URL was intercepted by a Cloudflare server, which is challenging the request. This challenge has to be bypassed before the initial request will be allowed to continue.

cfscrape Package


The OP stated that they attempted to use the cfscrape Python Package to obtain token information from the Cloudflare server.

A standard cfscrape request provide identical responses as Python Requests.

import cfscrapeURL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \      '-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}scraper = cfscrape.create_scraper(delay=10)response = scraper.get(URL, headers=headers)print(f'Status Code: {response.status_code}')print(f'Status Code Reason: {response.reason}')# outputStatus Code: 403Status Code Reason: Forbidden

The cfscrape package also supports the functions get_tokens and get_cookie_string, but both of these produce the 403 Forbidden error code.

From cfscrape source code:

def is_cloudflare_captcha_challenge(resp):        return (            resp.status_code == 403            and resp.headers.get("Server", "").startswith("cloudflare")            and b"/cdn-cgi/l/chk_captcha" in resp.content        )# the function above is called from thisdef request(self, method, url, *args, **kwargs):        resp = super(CloudflareScraper, self).request(method, url, *args, **kwargs)        # Check if Cloudflare captcha challenge is presented        if self.is_cloudflare_captcha_challenge(resp):            self.handle_captcha_challenge(resp, url)        # Check if Cloudflare anti-bot "I'm Under Attack Mode" is enabled        if self.is_cloudflare_iuam_challenge(resp):            resp = self.solve_cf_challenge(resp, **kwargs)        return resp

The handle_captcha_challenge function is what tries to solve the Cloudflare javascript challenge. This section of the code is what is failing. It's unclear what part of that section is failing, so additional research and testing is required.

PLEASE NOTE: According to the package's developer the module is no longer supported.

cloudscraper Package


The OP also stated that they attempted to use the cloudscraper Python Package to obtain token information from the Cloudflare server. It is worth nothing that cloudscraper was forked from cfscrape, so the syntax is similar.

cloudscraper gets the same 403 Forbidden error code as cfscrape.

import cloudscraperURL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \      '-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}scraper = cloudscraper.create_scraper()response = scraper.get(URL)print(f'Status Code: {response.status_code}')print(f'Status Code Reason: {response.reason}')# outputStatus Code: 403Status Code Reason: Forbidden

The cloudscraper package also supports the functions get_tokens and get_cookie_string, but both of these produce the 403 Forbidden error code.

selenium Package


The OP also stated that they attempted to use the selenium Python package.

SPECIAL NOTE: During my testing I used selenium with webdrivers for Google Chrome, Mozilla Firefox and Microsoft Edge.

Within the last 12 months these Options could be used in selenium to bypass Cloudflare protection. Unfortunately, these Options do not work today

chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])chrome_options.add_experimental_option('useAutomationExtension', False)# additional disable-blink-features are available in Chromium source code on Githubchrome_options.add_argument("--disable-blink-features=AutomationControlled")

Below is a selenium code example using the Chrome webdriver with the switches above.

from selenium import webdriverchrome_options = webdriver.ChromeOptions()chrome_options.add_argument(    "user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")chrome_options.add_argument("start-maximized")chrome_options.add_argument("--disable-blink-features=AutomationControlled")chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])chrome_options.add_experimental_option('useAutomationExtension', False)driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options)URL = "https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934"driver.get(URL)

The code above opens a browser session, which is confronted with a Cloudflare Javascript challenge. During testing with the switches mentioned above this challenge does not stop. The Cloudflare Ray ID, which are unique id per request rotate many times before I manually terminated the session.

enter image description here

seleniumwire is required to obtain the status code

Below is a headless mode Chrome webdriver session, which also shows the 403 Forbidden error code for the target URL. The session also shows that hcaptcha.com anti-bot technology is now in the mix.

from seleniumwire import webdriverchrome_options = webdriver.ChromeOptions()chrome_options.add_argument("start-maximized")chrome_options.add_argument("--headless")chrome_options.add_argument(    "user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])chrome_options.add_experimental_option('useAutomationExtension', False)chrome_options.add_argument("--disable-blink-features=AutomationControlled")driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options)URL = "https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934"driver.get(URL)for request in driver.requests:    print(f'Status Code: {request.response}')    print(f'Host Name: {request.host}')    # output     Status Code: 403     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 200     Host Name: www.cclonline.com    -----------------------    Status Code: 302     Host Name: hcaptcha.com    -----------------------    Status Code: 200     Host Name: newassets.hcaptcha.com    -----------------------driver.quit()

A standard Chrome webdriver session using the UI shows an iFrame with an "I am human" checkbox.

enter image description here

If I click the button manually or with selenium session, I'm prompted with a picture captcha, which increasing the complexity of bypassing the Cloudflare protection.

enter image description here

cf_clearance cookie


When a Cloudflare CAPTCHA or Javascript challenge is solved a cf_clearance cookie is set in the client browser. The cf_clearance cookie has a default lifetime of 30 minutes, but is configurable by the Cloudflare client.

If you open the OP's target URL manually in a Google Chrome browser you can see the cf_clearance cookie using Developer Tools

It seem that the cf_clearance cookie lifetime is set for 60 minutes based on the UTC time this session started and the expiration date set for the cookie.

So far I haven't found a way to extract this cookie using Python.

enter image description here


Something you need in header for request!

  • Cookie "cf_clearance"
  • User-agent

Sample

Steps to get cookies

  1. Open chrome devtools
  2. Switch to tab "Network"
  3. Copy request header

enter image description here

import requestsfrom bs4 import BeautifulSouplink = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'h = '''cookie: cf_clearance=718abb68f064be7612ee987ab9d8bc755016f3c2-1623437208-0-150user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4539.2 Safari/537.36'''h = dict(l.split(': ') for l in h.split('\n') if ': ' in l)res = requests.get(link, headers=h)soup = BeautifulSoup(res.text, "lxml")try:    product_title = soup.select_one("h1 > span").get_text(strip=True)except AttributeError:    product_title = ""print(product_title)