Python: Unicode and ElementTree.parse

python xml unicode python-3.x

Your problem is that you are feeding ElementTree unicode, but it prefers to consume bytes. It will provide you with unicode in any case.

In Python 2.x, it can only consume bytes. You can tell it what encoding those bytes are in, but that's it. So, if you literally have to work with an object that represents a text file, like io.StringIO, first you will need to convert it into something else.

If you are literally starting with a 2.x-str (AKA bytes) in UTF-8 encoding, in memory, as in your example, use xml.etree.cElementTree.XML to parse it into XML in one fell swoop and don't worry about any of this :-).

If you want an interface that can deal with data that is incrementally read from a file, use xml.etree.cElementTree.parse with an io.BytesIO to convert it into an in-memory stream of bytes rather than an in-memory string of characters. If you want to use io.open, use it with the b flag, so that you get streams of bytes.

In Python 3.x, you can pass unicode directly in to ElementTree, which is a bit more convenient, and arguably the newer version of ElementTree is more correct to allow this. However, you still might not want to, and Python 3's version does still accept bytes as input. You're always starting with bytes anyway: by passing them directly from your input source to ElementTree, you get to let it do its encoding or decoding intelligently inside the XML parsing engine, as well as do on-the-fly detection of encoding declarations within the input stream, which you can do with XML but you can't do with arbitrary textual data. So letting the XML parser do the work of decoding is the right place to put that responsibility.

python xml unicode python-3.x

I encountered the same problem as you in Python 2.6.

It seems that "utf-8" encoding for cElementTree.parse in Python 2.x and 3.x version are different. In Python 2.x, we can use XMLParser to encode the unicode. For example:

import xml.etree.cElementTree as etreeparser = etree.XMLParser(encoding="utf-8")targetTree = etree.parse( "./targetPageID.xml", parser=parser )pageIds = targetTree.find("categorymembers")print "pageIds:",etree.tostring(pageIds)

You can refer to this page for the XMLParser method (Section "XMLParser"): http://effbot.org/zone/elementtree-13-intro.htm

While the following method works for Python 3.x version:

import xml.etree.cElementTree as etreeimport codecstarget_file = codecs.open("./targetPageID.xml",mode='r',encoding='utf-8')targetTree = etree.parse( target_file )pageIds = targetTree.find("categorymembers")print "pageIds:",etree.tostring(pageIds)

Hope this can help you.

python xml unicode python-3.x

Can't you use

doc = ET.fromstring(source)

in your first example ?

CodeHunter

Python: Unicode and ElementTree.parse

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last