BOM in server response screws up json parsing BOM in server response screws up json parsing json json

BOM in server response screws up json parsing


You should probably yell at whoever's running this service, because a BOM on UTF-8 text makes no sense. The BOM exists to disambiguate byte order, and UTF-8 is defined as being little-endian.

That said, ideally you should decode bytes before doing anything else with them. Luckily, Python has a codec that recognizes and removes the BOM: utf-8-sig.

>>> '\xef\xbb\xbffoo'.decode('utf-8-sig')u'foo'

So you just need:

data = json.loads(response.decode('utf-8-sig'))


In case I'm not the only one who experienced the same problem, but is using requests module instead of urllib2, here is a solution that works in Python 2.6 as well as 3.3:

import requestsr = requests.get(url, params=my_dict, auth=(user, pass))print(r.headers['content-type'])  # 'application/json; charset=utf8'if r.text[0] == u'\ufeff':  # bytes \xef\xbb\xbf in utf-8 encoding    r.encoding = 'utf-8-sig'print(r.json())


Since I lack enough reputation for a comment, I'll write an answer instead.

I usually encounter that problem when I need to leave the underlying Stream of a StreamWriter open. However, the overload that has the option to leave the underlying Stream open needs an encoding (which will be UTF8 in most cases), here's how to do it without emitting the BOM.

/* Since Encoding.UTF8 (the one you'd normally use in those cases) **emits** * the BOM, use whats below instead! */// UTF8Encoding has an overload which enables / disables BOMs in the outputUTF8Encoding encoding = new UTF8Encoding(false);using (MemoryStream ms = new MemoryStream())using (StreamWriter sw = new StreamWriter(ms, encoding, 4096, true))using (JsonTextWriter jtw = new JsonTextWriter(sw)){    serializer.Serialize(jtw, myObject);}