Unicode (UTF-8) reading and writing to files in Python Unicode (UTF-8) reading and writing to files in Python python python

Unicode (UTF-8) reading and writing to files in Python


Rather than mess with the encode and decode methods I find it easier to specify the encoding when opening the file. The io module (added in Python 2.6) provides an io.open function, which has an encoding parameter.

Use the open method from the io module.

>>>import io>>>f = io.open("test", mode="r", encoding="utf-8")

Then after calling f's read() function, an encoded Unicode object is returned.

>>>f.read()u'Capit\xe1l\n\n'

Note that in Python 3, the io.open function is an alias for the built-in open function. The built-in open function only supports the encoding argument in Python 3, not Python 2.

Edit: Previously this answer recommended the codecs module. The codecs module can cause problems when mixing read() and readline(), so this answer now recommends the io module instead.

Use the open method from the codecs module.

>>>import codecs>>>f = codecs.open("test", "r", "utf-8")

Then after calling f's read() function, an encoded Unicode object is returned.

>>>f.read()u'Capit\xe1l\n\n'

If you know the encoding of a file, using the codecs package is going to be much less confusing.

See http://docs.python.org/library/codecs.html#codecs.open


In the notation

u'Capit\xe1n\n'

the "\xe1" represents just one byte. "\x" tells you that "e1" is in hexadecimal.When you write

Capit\xc3\xa1n

into your file you have "\xc3" in it. Those are 4 bytes and in your code you read them all. You can see this when you display them:

>>> open('f2').read()'Capit\\xc3\\xa1n\n'

You can see that the backslash is escaped by a backslash. So you have four bytes in your string: "\", "x", "c" and "3".

Edit:

As others pointed out in their answers you should just enter the characters in the editor and your editor should then handle the conversion to UTF-8 and save it.

If you actually have a string in this format you can use the string_escape codec to decode it into a normal string:

In [15]: print 'Capit\\xc3\\xa1n\n'.decode('string_escape')Capitán

The result is a string that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. If you want to have a unicode string you have to decode again with UTF-8.

To your edit: you don't have UTF-8 in your file. To actually see how it would look like:

s = u'Capit\xe1n\n'sutf8 = s.encode('UTF-8')open('utf-8.out', 'w').write(sutf8)

Compare the content of the file utf-8.out to the content of the file you saved with your editor.


Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')

[Edit on 2016-02-10 for requested clarification]

Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open

open(file, mode='r', buffering=-1,       encoding=None, errors=None, newline=None,       closefd=True, opener=None)

Encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)