u'\ufeff' in Python string u'\ufeff' in Python string python python

u'\ufeff' in Python string


I ran into this on Python 3 and found this question (and solution).When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding.

Without it, the BOM is included in the read result:

>>> f = open('file', mode='r')>>> f.read()'\ufefftest'

Giving the correct encoding, the BOM is omitted in the result:

>>> f = open('file', mode='r', encoding='utf-8-sig')>>> f.read()'test'

Just my 2 cents.


The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:

#!python2#coding: utf8u = u'ABC'e8 = u.encode('utf-8')        # encode without BOMe8s = u.encode('utf-8-sig')   # encode with BOMe16 = u.encode('utf-16')      # encode with BOMe16le = u.encode('utf-16le')  # encode without BOMe16be = u.encode('utf-16be')  # encode without BOMprint 'utf-8     %r' % e8print 'utf-8-sig %r' % e8sprint 'utf-16    %r' % e16print 'utf-16le  %r' % e16leprint 'utf-16be  %r' % e16beprintprint 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

Note that EF BB BF is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).

Output:

utf-8     'ABC'utf-8-sig '\xef\xbb\xbfABC'utf-16    '\xff\xfeA\x00B\x00C\x00'    # Adds BOM and encodes using native processor endian-ness.utf-16le  'A\x00B\x00C\x00'utf-16be  '\x00A\x00B\x00C'utf-8  w/ BOM decoded with utf-8     u'\ufeffABC'    # doesn't remove BOM if present.utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC'    # doesn't remove BOM if present.

Note that the utf-16 codec requires BOM to be present, or Python won't know if the data is big- or little-endian.


That character is the BOM or "Byte Order Mark". It is usually received as the first few bytes of a file, telling you how to interpret the encoding of the rest of the data. You can simply remove the character to continue. Although, since the error says you were trying to convert to 'ascii', you should probably pick another encoding for whatever you were trying to do.