Using xmllint and xpath with a less-than-perfect HTML document? Using xmllint and xpath with a less-than-perfect HTML document? xml xml

Using xmllint and xpath with a less-than-perfect HTML document?


You can enable the HTML parser in xmllint using the --html command line option. That way, you will be able to process HTML documents.


If does not abort the parsing, you can just hide the errors with:

2>/dev/null

Then there is Xidel, which I made just for picking some data from html pages. (although it is not perfect. I was told about two malformed documents it could not handle)

xidel  html.out -e //yourquery...


You should pre-process the HTML with a lenient parser. (That's the main difference: HTML is allowed a much more lax syntax than XML.) That is, try HTML5-Tidy and let XMLLint work on the result:

input HTML | vTidy | vxmllint | vresult