UnicodeDecodeError when performing os.walk

python unicode encoding utf-8 utf-16

Right I just spent some time sorting through this error, and wordier answers here aren't getting at the underlying issue:

The problem is, if you pass a unicode string into os.walk(), then os.walk starts getting unicode back from os.listdir() and tries to keep it as ASCII (hence 'ascii' decode error). When it hits a unicode only special character which str() can't translate, it throws the exception.

The solution is to force the starting path you pass to os.walk to be a regular string - i.e. os.walk(str(somepath)). This means os.listdir returns regular byte-like strings and everything works the way it should.

You can reproduce this problem (and show it's solution works) trivially like:

Go into bash in some directory and run touch $(echo -e "\x8b\x8bThis is a bad filename") which will make some test files.

Now run the following Python code (iPython Qt is handy for this) in the same directory:

l = []for root,dir,filenames in os.walk(unicode('.')):    l.extend([ os.path.join(root, f) for f in filenames ])print l

And you'll get a UnicodeDecodeError.

Now try running:

l = []for root,dir,filenames in os.walk('.'):    l.extend([ os.path.join(root, f) for f in filenames ])print l

No error and you get a print out!

Thus the safe way in Python 2.x is to make sure you only pass raw text to os.walk(). You absolutely should not pass unicode or things which might be unicode to it, because os.walk will then choke when an internal ascii conversion fails.

python unicode encoding utf-8 utf-16

This problem stems from two fundamental problems. The first is fact that Python 2.x default encoding is 'ascii', while the default Linux encoding is 'utf8'. You can verify these encodings via:

sys.getdefaultencoding() #pythonsys.getfilesystemencoding() #OS

When os module functions returning directory contents, namely os.walk & os.listdir return a list of files containing ascii only filenames and non-ascii filenames, the ascii-encoding filenames are converted automatically to unicode. The others are not. Therefore, the result is a list containing a mix of unicode and str objects. It is the str objects that can cause problems down the line. Since they are not ascii, python has no way of knowing what encoding to use, and therefore they can't be decoded automatically into unicode.

Therefore, when performing common operations such as os.path(dir, file), where dir is unicode and file is an encoded str, this call will fail if the file is not ascii-encoded (the default). The solution is to check each filename as soon as they are retrieved and decode the str (encoded ones) objects to unicode using the appropriate encoding.

That's the first problem and its solution. The second is a bit trickier. Since the files originally came from a Windows system, their filenames probably use an encoding called windows-1252. An easy means of checking is to call:

filename.decode('windows-1252')

If a valid unicode version results you probably have the correct encoding. You can further verify by calling print on the unicode version as well and see the correct filename rendered.

One last wrinkle. In a Linux system with files of Windows origin, it is possible or even probably to have a mix of windows-1252 and utf8 encodings. There are two means of dealing with this mixture. The first and preferable is to run:

$ convmv -f windows-1252 -t utf8 -r DIRECTORY --notest

where DIRECTORY is the one containing the files needing conversion.This command will convert any windows-1252 encoded filenames to utf8. It does a smart conversion, in that if a filename is already utf8 (or ascii), it will do nothing.

The alternative (if one cannot do this conversion for some reason) is to do something similar on the fly in python. To wit:

def decodeName(name):    if type(name) == str: # leave unicode ones alone        try:            name = name.decode('utf8')        except:            name = name.decode('windows-1252')    return name

The function tries a utf8 decoding first. If it fails, then it falls back to the windows-1252 version. Use this function after a os call returning a list of files:

root, dirs, files = os.walk(path):    files = [decodeName(f) for f in files]    # do something with the unicode filenames now

I personally found the entire subject of unicode and encoding very confusing, until I read this wonderful and simple tutorial:

http://farmdev.com/talks/unicode/

I highly recommend it for anyone struggling with unicode issues.

python unicode encoding utf-8 utf-16

I can reproduce the os.listdir() behavior: os.listdir(unicode_name) returns undecodable entries as bytes on Python 2.7:

>>> import os>>> os.listdir(u'.')[u'abc', '<--\x8b-->']

Notice: the second name is a bytestring despite listdir()'s argument being a Unicode string.

A big question remains however - how can this be solved without resorting to this hack?

Python 3 solves undecodable bytes (using filesystem's character encoding) bytes in filenames via surrogateescape error handler (os.fsencode/os.fsdecode). See PEP-383: Non-decodable Bytes in System Character Interfaces:

>>> os.listdir(u'.')['abc', '<--\udc8b-->']

Notice: both string are Unicode (Python 3). And surrogateescape error handler was used for the second name. To get the original bytes back:

>>> os.fsencode('<--\udc8b-->')b'<--\x8b-->'

In Python 2, use Unicode strings for filenames on Windows (Unicode API), OS X (utf-8 is enforced) and use bytestrings on Linux and other systems.

CodeHunter

UnicodeDecodeError when performing os.walk

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last