Anyone know of a good Python based web crawler that I could use? Anyone know of a good Python based web crawler that I could use? python python

Anyone know of a good Python based web crawler that I could use?


  • Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
  • Twill is a simple scripting language built on top of Mechanize
  • BeautifulSoup + urllib2 also works quite nicely.
  • Scrapy looks like an extremely promising project; it's new.


Use Scrapy.

It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:

  • Built-in support for parsing HTML, XML, CSV, and Javascript
  • A media pipeline for scraping items with images (or any other media) and download the image files as well
  • Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
  • Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
  • Interactive scraping shell console, very useful for developing and debugging
  • Web management console for monitoring and controlling your bot
  • Telnet console for low-level access to the Scrapy process

Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:

class Torrent(ScrapedItem):    passclass MininovaSpider(CrawlSpider):    domain_name = 'mininova.org'    start_urls = ['http://www.mininova.org/today']    rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]    def parse_torrent(self, response):        x = HtmlXPathSelector(response)        torrent = Torrent()        torrent.url = response.url        torrent.name = x.x("//h1/text()").extract()        torrent.description = x.x("//div[@id='description']").extract()        torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()        return [torrent]


Check the HarvestMan, a multi-threaded web-crawler written in Python, also give a look to the spider.py module.

And here you can find code samples to build a simple web-crawler.