Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

python screen-scraping beautifulsoup mechanize http-status-code-403

oh you need to ignore the robots.txt

br = mechanize.Browser()br.set_handle_robots(False)

python screen-scraping beautifulsoup mechanize http-status-code-403

You can try lying about your user agent (e.g., by trying to make believe you're a human being and not a robot) if you want to get in possible legal trouble with Barnes & Noble. Why not instead get in touch with their business development department and convince them to authorize you specifically? They're no doubt just trying to avoid getting their site scraped by some classes of robots such as price comparison engines, and if you can convince them that you're not one, sign a contract, etc, they may well be willing to make an exception for you.

A "technical" workaround that just breaks their policies as encoded in robots.txt is a high-legal-risk approach that I would never recommend. BTW, how does their robots.txt read?

python screen-scraping beautifulsoup mechanize http-status-code-403

The code to make a correct request:

br = mechanize.Browser()br.set_handle_robots(False)br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]resp = br.open(url)print resp.info()  # headersprint resp.read()  # content

CodeHunter

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last