Why doesn't xpath work when processing an XHTML document with lxml (in python)?

python xml xhtml xpath lxml

The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.

Try this:

>>> tree.getroot().xpath(...     "//xhtml:img", ...     namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}...     )[<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]

python xml xhtml xpath lxml

XPath considers all unprefixed names to be in "no namespace".

In particular the spec says:

"A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). "

See those two detailed explanations of the problem and its solution: here and here. The solution is to associate a prefix (with the API that's being used) and to use it to prefix any unprefixed name in the XPath expression.

Hope this helped.

Cheers,

Dimitre Novatchev

python xml xhtml xpath lxml

If you are going to use tags from a single namespace only, as I see it the case above, you are much better off using lxml.objectify.

In your case it would be like

from lxml import objectifyroot = objectify.parse(url) #also available: fromstring

You can access the nodes as

root.htmlbody = root.html.bodyfor img in body.img: #Assuming all images are within the body tag

While it might not be of great help in html, it can be highly useful in well structured xml.

For more info, check out http://lxml.de/objectify.html

CodeHunter

Why doesn't xpath work when processing an XHTML document with lxml (in python)?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last