Python XML Parsing without root
ElementTree.fromstringlist
accepts an iterable (that yields strings).
Using it with itertools.chain
:
import itertoolsimport xml.etree.ElementTree as ET# import xml.etree.cElementTree as ETwith open('xml-like-file.xml') as f: it = itertools.chain('<root>', f, '</root>') root = ET.fromstringlist(it)# Do something with `root`root.find('.//tag3')
How about instead of editing the file do something like this
import xml.etree.ElementTree as ETwith file("xml-file.xml") as f: xml_object = ET.fromstringlist(["<root>", f.read(), "</root>"])
lxml.html
can parse fragments:
from lxml import htmls = """<tag1> <tag2> </tag2></tag1><tag1> <tag3/></tag1>"""doc = html.fromstring(s)for thing in doc: print thing for other in thing: print other""">>> <Element tag1 at 0x3411a80><Element tag2 at 0x3428990><Element tag1 at 0x3428930><Element tag3 at 0x3411a80>>>>"""
Courtesy this SO answer
And if there is more than one level of nesting:
def flatten(nested): """recusively flatten nested elements yields individual elements """ for thing in nested: yield thing for other in flatten(thing): yield otherdoc = html.fromstring(s)for thing in flatten(doc): print thing
Similarly, lxml.etree.HTML
will parse this. It adds html and body tags:
d = etree.HTML(s)for thing in d.iter(): print thing""" <Element html at 0x3233198><Element body at 0x322fcb0><Element tag1 at 0x3233260><Element tag2 at 0x32332b0><Element tag1 at 0x322fcb0><Element tag3 at 0x3233148>"""