Python Requests getting ('Connection aborted.', BadStatusLine("''",)) error Python Requests getting ('Connection aborted.', BadStatusLine("''",)) error python-3.x python-3.x

Python Requests getting ('Connection aborted.', BadStatusLine("''",)) error


The error you get indicates the host isn't responding in the expected manner. In this case, it's because it detects that you're trying to scrape it and deliberately disconnecting you.

If you try your requests code with this URL from a test website: http://mirror.internode.on.net/pub/test/5meg.test1, you'll see that it downloads normally.

To get around this, fake your user agent. Your user agent identifies your web browser, and web hosts commonly check it to detect bots.

Use the headers field to set your user agent. Here's an example which tells the webhost you're Firefox.

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0' }r = requests.get(url, headers=headers)

There are lots of other discrepancies1 between bots and human-operated browsers that web hosts can check for, but user agent is one of the easiest and common ones.

If you want your scraper to be harder to detect, you'll want to use a headless browser like headless Chrome2 (or ghost.py if you want to stick with Python), which you can trust will behave like a real browser (because it is!).


Footnotes:

1Possible other checks include checks for if images aren't being downloaded, page resources aren't downloaded in the normal order, pages being downloaded faster than a human can read them, and cookies not being set properly. Google flags mouse movements deemed insufficiently human-like.

2Headless Chrome is the most competent headless browser in 2018, but if its weight is a problem for you, its slightly-outdated predecessors, PhantomJS and ghost.py, are lighter weight and still usable.


try this:

headers = {    'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0',    'ACCEPT' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',    'ACCEPT-ENCODING' : 'gzip, deflate, br',    'ACCEPT-LANGUAGE' : 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7',    'REFERER' : 'https://www.google.com/'}    r = requests.get("http://yourdomain.com/", headers=headers)


In my case, i must remove the user agent fields from headers

url='https://...'headers = {}requests.get(url, headers=headers)

once i set 'User-Agent', it getting ('Connection aborted.', BadStatusLine("''",))and this error occurs only with the individual site.my first post,i get many helps from this site, hope it can help others who find here