autogenerate HTTP screen scraping Java code autogenerate HTTP screen scraping Java code selenium selenium

autogenerate HTTP screen scraping Java code


I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.

It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.

If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.


The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.


You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.