Language/libraries for downloading & parsing web pages?
If you want to spend some time with Clojure (a very good idea IMO!), give Enlive a shot. The GitHub description reads
a selector-based (à la CSS) templating and transformation system for Clojure — Read more
In addition to being useful for templating, it's a capable webscraping library; see the initial part of this tutorial for some simple scraping examples. (The third one is the New York Times homepage, so actually not as simple as all that.)
There are other tutorials available on the Web if you look for them; Enlive itself comes with some docs / examples. (Plus the code is < 1000 lines in total and very readable, though I suppose this might be less so for someone new to the language.)
Clojure link dumps, covering enlive, based on tagSoup and agents for parallel downloads (roundups/ link dumps aren't pretty, but I did spend some time googling/searching for different libs. Spidering/crawling can be very easy or pretty involved depending on the structure of sites crawled, HTML, XHTML, etc.)
http://blog.bestinclass.dk/index.php/2009/10/functional-social-webscraping/
http://nakkaya.com/2009/12/17/mashups-using-clojure/
http://freegeek.in/blog/2009/10/downloading-a-bunch-of-files-in-parallel-using-clojure-agents/
http://blog.maryrosecook.com/post/46601664/Writing-an-mp3-crawler-in-Clojure
http://gnuvince.wordpress.com/2008/11/18/fetching-web-comics-with-clojure-part-2/
http://htmlparser.sourceforge.net/
http://nakkaya.com/2009/11/23/converting-html-to-compojure-dsl/
http://www.bestinclass.dk/index.php/2009/10/functional-social-webscraping/
apache http client
http://github.com/rnewman/clj-apache-http
http://github.com/heyZeus/clj-web-crawler
http://japhr.blogspot.com/2009/01/clojure-http-clientclj.html
Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) is a good python library for this. It specializes in dealing with malformed markup.