Python3 UnicodeDecodeError with readlines() method Python3 UnicodeDecodeError with readlines() method python python

Python3 UnicodeDecodeError with readlines() method


I think the best answer (in Python 3) is to use the errors= parameter:

with open('evil_unicode.txt', 'r', errors='replace') as f:    lines = f.readlines()

Proof:

>>> s = b'\xe5abc\nline2\nline3'>>> with open('evil_unicode.txt','wb') as f:...     f.write(s)...16>>> with open('evil_unicode.txt', 'r') as f:...     lines = f.readlines()...Traceback (most recent call last):  File "<stdin>", line 2, in <module>  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode    (result, consumed) = self._buffer_decode(data, self.errors, final)UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: invalid continuation byte>>> with open('evil_unicode.txt', 'r', errors='replace') as f:...     lines = f.readlines()...>>> lines['�abc\n', 'line2\n', 'line3']>>>

Note that the errors= can be replace or ignore. Here's what ignore looks like:

>>> with open('evil_unicode.txt', 'r', errors='ignore') as f:...     lines = f.readlines()...>>> lines['abc\n', 'line2\n', 'line3']


Your default encoding appears to be ASCII, where the input is more than likely UTF-8. When you hit non-ASCII bytes in the input, it's throwing the exception. It's not so much that readlines itself is responsible for the problem; rather, it's causing the read+decode to occur, and the decode is failing.

It's an easy fix though; the default open in Python 3 allows you to provide the known encoding of an input, replacing the default (ASCII in your case) with any other recognized encoding. Providing it allows you to keep reading as str (rather than the significantly different raw binary data bytes objects), while letting Python do the work of converting from raw disk bytes to true text data:

# Using with statement closes the file for us without needing to remember to close# explicitly, and closes even when exceptions occurwith open(argfile, encoding='utf-8') as inf:    f = inf.readlines()


Ended up finding a working answer for myself:

filename=open(argfile, 'rb')

This post helped me out a lot.