Anyone know of a good Python based web crawler that I could use?

python web-crawler

Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
Twill is a simple scripting language built on top of Mechanize
BeautifulSoup + urllib2 also works quite nicely.
Scrapy looks like an extremely promising project; it's new.

python web-crawler

It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:

Built-in support for parsing HTML, XML, CSV, and Javascript
A media pipeline for scraping items with images (or any other media) and download the image files as well
Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
Interactive scraping shell console, very useful for developing and debugging
Web management console for monitoring and controlling your bot
Telnet console for low-level access to the Scrapy process

Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:

class Torrent(ScrapedItem):    passclass MininovaSpider(CrawlSpider):    domain_name = 'mininova.org'    start_urls = ['http://www.mininova.org/today']    rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]    def parse_torrent(self, response):        x = HtmlXPathSelector(response)        torrent = Torrent()        torrent.url = response.url        torrent.name = x.x("//h1/text()").extract()        torrent.description = x.x("//div[@id='description']").extract()        torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()        return [torrent]

python web-crawler

Check the HarvestMan, a multi-threaded web-crawler written in Python, also give a look to the spider.py module.

And here you can find code samples to build a simple web-crawler.

CodeHunter

Anyone know of a good Python based web crawler that I could use?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last