What is the difference in accessing Cloudflare website using ChromeDriver/Chrome in normal/headless mode through Selenium Python What is the difference in accessing Cloudflare website using ChromeDriver/Chrome in normal/headless mode through Selenium Python selenium selenium

What is the difference in accessing Cloudflare website using ChromeDriver/Chrome in normal/headless mode through Selenium Python


It's the HTTP User-Agent header that Cloudflare doesn't like.

To get around this issue, simply change your user-agent chrome option (below code is for Selenium in Python):

option.add_argument('--headless')option.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36")


I tested using this server-side script:

<?phpecho "<pre><code>";var_dump($_SERVER);echo "</code></pre>";?><script>    var el = document.getElementsByTagName('code')[0];    for(var prop in window.navigator){        var str = JSON.stringify(window.navigator[prop])        el.innerHTML = el.innerHTML + "window.navigator." + prop + " = " + str + "\n";    }    var skip_props = ['parent', 'top', 'frames', 'self', 'window'];    for(var prop in window){        if (skip_props.indexOf(prop) > -1) { continue; }        el.innerHTML = el.innerHTML + "window." + prop + " = ";        var str = JSON.stringify(window[prop])        el.innerHTML = el.innerHTML + str + "\n";    }</script>

I loaded this page using ChromeDriver, with and without using --headless, and printed the output using print(driver.find_element_by_tag_name('code').text). I then diff-ed both outputs.
Here's the differences I found:

  • HTTP Accept-Language header: en-US,en;q=0.9 vs en-US
  • HTTP User-Agent header: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 vs Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/83.0.4103.61 Safari/537.36 (Note the HeadlessChrome mention in the second string.)
  • Javascript window.navigator.plugins: {"0":{"0":{}},"1":{"0":{}},"2":{"0":{},"1":{}}} vs {}
  • Javascript window.navigator.mimeTypes: {"0":{},"1":{},"2":{},"3":{}} vs {}
  • Javascript window.outerWidth: 1367 vs 0
  • Javascript window.outerHeight: 641 vs 0

Of note: in the Python script you posted, you are missing a few lines, to remove the window.webdriver property (without this, it is trivial for the server to detect you are using WebDriver) [ref]:

driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {    "source": """    Object.defineProperty(navigator, 'webdriver', {      get: () => undefined    })  """})


I took your code, removed the optional arguments and added a few arguments to execute the test as follows:

  • Code Block:

    from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECoptions = webdriver.ChromeOptions() options.add_argument("start-maximized")options.add_argument("--headless")options.add_experimental_option("excludeSwitches", ["enable-automation"])options.add_experimental_option('useAutomationExtension', False)driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')driver.get("https://www.manta.com/c/mm2956g/mashuda-contractors")print(driver.page_source)driver.quit()
  • Console Output:

    <html class="js" lang="en-US" style="opacity: 1; visibility: visible;"><!--<![endif]--><head><title>Access denied | www.manta.com used Cloudflare to restrict access</title><meta charset="UTF-8"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1"><meta name="robots" content="noindex, nofollow"><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1"><link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css" media="screen,projection"><!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]--><style type="text/css">body{margin:0;padding:0}</style><!--[if gte IE 10]><!--><script type="text/javascript" src="/cdn-cgi/scripts/zepto.min.js"></script><!--<![endif]--><!--[if gte IE 10]><!--><script type="text/javascript" src="/cdn-cgi/scripts/cf.common.js"></script><!--<![endif]--></head><body>  <div id="cf-wrapper">    <div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>    <div id="cf-error-details" class="cf-error-details-wrapper">      <div class="cf-wrapper cf-header cf-error-overview">    <h1>      <span class="cf-error-type" data-translate="error">Error</span>      <span class="cf-error-code">1020</span>      <small class="heading-ray-id">Ray ID: 53fd7c2fca12d5fc • 2019-12-04 11:36:52 UTC</small>    </h1>    <h2 class="cf-subheadline">Access denied</h2>      </div><!-- /.header -->      <section></section><!-- spacer -->      <div class="cf-section cf-wrapper">    <div class="cf-columns two">      <div class="cf-column">        <h2 data-translate="what_happened">What happened?</h2>        <p>This website is using a security service to protect itself from online attacks.</p>      </div>    </div>      </div><!-- /.section -->      <div class="cf-error-footer cf-wrapper">  <p>    <span class="cf-footer-item">Cloudflare Ray ID: <strong>53fd7c2fca12d5fc</strong></span>    <span class="cf-footer-separator">•</span>    <span class="cf-footer-item"><span>Your IP</span>: 123.201.54.43</span>    <span class="cf-footer-separator">•</span>    <span class="cf-footer-item"><span>Performance & security by</span> <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=error_footer" id="brand_link" target="_blank">Cloudflare</a></span>  </p></div><!-- /.error-footer -->    </div><!-- /#cf-error-details -->  </div><!-- /#cf-wrapper -->  <script type="text/javascript">  window._cf_translation = {};</script></body></html>

Analysis

From the extracted page source it is pretty clear using --headless argument you are reaching to a page with:

  • Heading as: Access denied | www.manta.com used Cloudflare to restrict access.
  • Some information: What happened?: This website is using a security service to protect itself from online attacks.

Conclusion

The Browsing Context i.e. Chrome Browser session is getting detected as a BOT and the navigation is blocked.


Outro

You can find a couple of relevant discussions in: