Language/libraries for downloading & parsing web pages? Language/libraries for downloading & parsing web pages? ruby ruby

Language/libraries for downloading & parsing web pages?


If you want to spend some time with Clojure (a very good idea IMO!), give Enlive a shot. The GitHub description reads

a selector-based (à la CSS) templating and transformation system for Clojure — Read more

In addition to being useful for templating, it's a capable webscraping library; see the initial part of this tutorial for some simple scraping examples. (The third one is the New York Times homepage, so actually not as simple as all that.)

There are other tutorials available on the Web if you look for them; Enlive itself comes with some docs / examples. (Plus the code is < 1000 lines in total and very readable, though I suppose this might be less so for someone new to the language.)


Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) is a good python library for this. It specializes in dealing with malformed markup.