Win32.: How to scrape HTML without regular expressions?

html windows regex winapi screen-scraping

Python:

lxml - faster, perhaps better at parsing bad HTML

BeautifulSoup - if lxml fails on your input try this.

Ruby: (heard of the following libraries, but never tried them)

Though if your parsers choke, and you can roughly pinpoint what is causing the choking, I frankly think it's okay to use a regex hack to remove that portion before passing it to the parser.

If you do decide on using lxml, here are some XPath tutorials that you may find useful. The lxml tutorials kind of assume that you know what XPath is (which I didn't when I first read them.)

Edit: Your post has really grown since it first came out... I'll try to answer what I can.

i don't think XPath can select higher level nodes based on criteria of lower level nodes:

It can. Try //div[@class='vehicleInfo']/parent::div[@class='used_result_container']. Use ancestor if you need to go up more levels. lxml also provides a getparent() method on its search results, and you could use that too. Really, you should look at the XPath sites I linked; you can probably solve your problems from there.

how then do you access repeating structures of data?

It would seem that DOM queries are exactly suited to your needs. XPath queries return you a list of the elements found -- what more could you want? And despite its name, lxml does accept 'loose HTML'. Moreover, the parser recognizes the 'sign-posts' in the HTML and structures the whole document accordingly, so you don't have to do it yourself.

Yes, you are still have to do a search on the structure, but at a higher level of abstraction. If the site designers decide to do a page overhaul and completely change the names and structure of their divs, then that's too bad, you have to rewrite your queries, but it should take less time than rewriting your regex. Nothing will do it automatically for you, unless you want to write some AI capabilities into your page-scraper...

I apologize for not providing 'native Win32' libraries, I'd assumed at first that you simply meant 'runs on Windows'. But the others have answered that part.

html windows regex winapi screen-scraping

Native Win32

You can always use IHtmlDocument2. This is built-in to Windows at this point. With this COM interface, you get native access to a powerful DOM parser (IE's DOM parser!).

html windows regex winapi screen-scraping

Use Html Agility Pack for .NET

Update

Since you need something native/antique, and the markup is likely bad, I would recommend running the markup through Tidy and then parsing it with Xerces

CodeHunter

Win32.: How to scrape HTML without regular expressions?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last