Why doesn't xpath work when processing an XHTML document with lxml (in python)? Why doesn't xpath work when processing an XHTML document with lxml (in python)? python python

Why doesn't xpath work when processing an XHTML document with lxml (in python)?


The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.

Try this:

>>> tree.getroot().xpath(...     "//xhtml:img", ...     namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}...     )[<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]


XPath considers all unprefixed names to be in "no namespace".

In particular the spec says:

"A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). "

See those two detailed explanations of the problem and its solution: here and here. The solution is to associate a prefix (with the API that's being used) and to use it to prefix any unprefixed name in the XPath expression.

Hope this helped.

Cheers,

Dimitre Novatchev


If you are going to use tags from a single namespace only, as I see it the case above, you are much better off using lxml.objectify.

In your case it would be like

from lxml import objectifyroot = objectify.parse(url) #also available: fromstring

You can access the nodes as

root.htmlbody = root.html.bodyfor img in body.img: #Assuming all images are within the body tag

While it might not be of great help in html, it can be highly useful in well structured xml.

For more info, check out http://lxml.de/objectify.html