Python: Unicode and ElementTree.parse Python: Unicode and ElementTree.parse xml xml

Python: Unicode and ElementTree.parse


Your problem is that you are feeding ElementTree unicode, but it prefers to consume bytes. It will provide you with unicode in any case.

In Python 2.x, it can only consume bytes. You can tell it what encoding those bytes are in, but that's it. So, if you literally have to work with an object that represents a text file, like io.StringIO, first you will need to convert it into something else.

If you are literally starting with a 2.x-str (AKA bytes) in UTF-8 encoding, in memory, as in your example, use xml.etree.cElementTree.XML to parse it into XML in one fell swoop and don't worry about any of this :-).

If you want an interface that can deal with data that is incrementally read from a file, use xml.etree.cElementTree.parse with an io.BytesIO to convert it into an in-memory stream of bytes rather than an in-memory string of characters. If you want to use io.open, use it with the b flag, so that you get streams of bytes.

In Python 3.x, you can pass unicode directly in to ElementTree, which is a bit more convenient, and arguably the newer version of ElementTree is more correct to allow this. However, you still might not want to, and Python 3's version does still accept bytes as input. You're always starting with bytes anyway: by passing them directly from your input source to ElementTree, you get to let it do its encoding or decoding intelligently inside the XML parsing engine, as well as do on-the-fly detection of encoding declarations within the input stream, which you can do with XML but you can't do with arbitrary textual data. So letting the XML parser do the work of decoding is the right place to put that responsibility.


I encountered the same problem as you in Python 2.6.

It seems that "utf-8" encoding for cElementTree.parse in Python 2.x and 3.x version are different. In Python 2.x, we can use XMLParser to encode the unicode. For example:

import xml.etree.cElementTree as etreeparser = etree.XMLParser(encoding="utf-8")targetTree = etree.parse( "./targetPageID.xml", parser=parser )pageIds = targetTree.find("categorymembers")print "pageIds:",etree.tostring(pageIds)

You can refer to this page for the XMLParser method (Section "XMLParser"): http://effbot.org/zone/elementtree-13-intro.htm

While the following method works for Python 3.x version:

import xml.etree.cElementTree as etreeimport codecstarget_file = codecs.open("./targetPageID.xml",mode='r',encoding='utf-8')targetTree = etree.parse( target_file )pageIds = targetTree.find("categorymembers")print "pageIds:",etree.tostring(pageIds)

Hope this can help you.


Can't you use

doc = ET.fromstring(source)

in your first example ?