u'\ufeff' in Python string

python unicode utf-8

I ran into this on Python 3 and found this question (and solution).When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding.

Without it, the BOM is included in the read result:

>>> f = open('file', mode='r')>>> f.read()'\ufefftest'

Giving the correct encoding, the BOM is omitted in the result:

>>> f = open('file', mode='r', encoding='utf-8-sig')>>> f.read()'test'

Just my 2 cents.

python unicode utf-8

The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:

#!python2#coding: utf8u = u'ABC'e8 = u.encode('utf-8')        # encode without BOMe8s = u.encode('utf-8-sig')   # encode with BOMe16 = u.encode('utf-16')      # encode with BOMe16le = u.encode('utf-16le')  # encode without BOMe16be = u.encode('utf-16be')  # encode without BOMprint 'utf-8     %r' % e8print 'utf-8-sig %r' % e8sprint 'utf-16    %r' % e16print 'utf-16le  %r' % e16leprint 'utf-16be  %r' % e16beprintprint 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

Note that EF BB BF is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).

Output:

utf-8     'ABC'utf-8-sig '\xef\xbb\xbfABC'utf-16    '\xff\xfeA\x00B\x00C\x00'    # Adds BOM and encodes using native processor endian-ness.utf-16le  'A\x00B\x00C\x00'utf-16be  '\x00A\x00B\x00C'utf-8  w/ BOM decoded with utf-8     u'\ufeffABC'    # doesn't remove BOM if present.utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC'    # doesn't remove BOM if present.

Note that the utf-16 codec requires BOM to be present, or Python won't know if the data is big- or little-endian.

python unicode utf-8

That character is the BOM or "Byte Order Mark". It is usually received as the first few bytes of a file, telling you how to interpret the encoding of the rest of the data. You can simply remove the character to continue. Although, since the error says you were trying to convert to 'ascii', you should probably pick another encoding for whatever you were trying to do.

CodeHunter

u'\ufeff' in Python string

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last