Win32.: How to scrape HTML without regular expressions? Win32.: How to scrape HTML without regular expressions? windows windows

Win32.: How to scrape HTML without regular expressions?


Python:

lxml - faster, perhaps better at parsing bad HTML

BeautifulSoup - if lxml fails on your input try this.

Ruby: (heard of the following libraries, but never tried them)

Nokogiri

hpricot

Though if your parsers choke, and you can roughly pinpoint what is causing the choking, I frankly think it's okay to use a regex hack to remove that portion before passing it to the parser.

If you do decide on using lxml, here are some XPath tutorials that you may find useful. The lxml tutorials kind of assume that you know what XPath is (which I didn't when I first read them.)

Edit: Your post has really grown since it first came out... I'll try to answer what I can.

i don't think XPath can select higher level nodes based on criteria of lower level nodes:

It can. Try //div[@class='vehicleInfo']/parent::div[@class='used_result_container']. Use ancestor if you need to go up more levels. lxml also provides a getparent() method on its search results, and you could use that too. Really, you should look at the XPath sites I linked; you can probably solve your problems from there.

how then do you access repeating structures of data?

It would seem that DOM queries are exactly suited to your needs. XPath queries return you a list of the elements found -- what more could you want? And despite its name, lxml does accept 'loose HTML'. Moreover, the parser recognizes the 'sign-posts' in the HTML and structures the whole document accordingly, so you don't have to do it yourself.

Yes, you are still have to do a search on the structure, but at a higher level of abstraction. If the site designers decide to do a page overhaul and completely change the names and structure of their divs, then that's too bad, you have to rewrite your queries, but it should take less time than rewriting your regex. Nothing will do it automatically for you, unless you want to write some AI capabilities into your page-scraper...

I apologize for not providing 'native Win32' libraries, I'd assumed at first that you simply meant 'runs on Windows'. But the others have answered that part.


Native Win32

You can always use IHtmlDocument2. This is built-in to Windows at this point. With this COM interface, you get native access to a powerful DOM parser (IE's DOM parser!).


Use Html Agility Pack for .NET

Update

Since you need something native/antique, and the markup is likely bad, I would recommend running the markup through Tidy and then parsing it with Xerces