How do I use xml namespaces with find/findall in lxml?
If root.nsmap
contains the table
namespace prefix then you could:
root.xpath('.//table:table', namespaces=root.nsmap)
findall(path)
accepts {namespace}name
syntax instead of namespace:name
. Therefore path
should be preprocessed using namespace dictionary to the {namespace}name
form before passing it to findall()
.
Maybe the first thing to notice is that the namespacesare defined at Element level, not Document level.
Most often though, all namespaces are declared in the document'sroot element (office:document-content
here), which saves us parsing it all to collect inner xmlns
scopes.
Then an element nsmap includes :
- a default namespace, with
None
prefix (not always) - all ancestors namespaces, unless overridden.
If, as ChrisR mentionned, the default namespace is not supported,you can use a dict comprehension to filter it outin a more compact expression.
You have a slightly different syntax for xpath andElementPath.
So here's the code you could use to get all your first table's rows(tested with: lxml=3.4.2
) :
import zipfilefrom lxml import etree# Open and parse the documentzf = zipfile.ZipFile('spreadsheet.ods')tree = etree.parse(zf.open('content.xml'))# Get the root elementroot = tree.getroot()# get its namespace map, excluding default namespacensmap = {k:v for k,v in root.nsmap.iteritems() if k}# use defined prefixes to access elementstable = tree.find('.//table:table', nsmap)rows = table.findall('table:table-row', nsmap)# or, if xpath is needed:table = tree.xpath('//table:table', namespaces=nsmap)[0]rows = table.xpath('table:table-row', namespaces=nsmap)
Here's a way to get all the namespaces in the XML document (and supposing there's no prefix conflict).
I use this when parsing XML documents where I do know in advance what the namespace URLs are, and only the prefix.
doc = etree.XML(XML_string) # Getting all the name spaces. nsmap = {} for ns in doc.xpath('//namespace::*'): if ns[0]: # Removes the None namespace, neither needed nor supported. nsmap[ns[0]] = ns[1] doc.xpath('//prefix:element', namespaces=nsmap)