How to get string objects instead of Unicode from JSON?
While there are some good answers here, I ended up using PyYAML to parse my JSON files, since it gives the keys and values as str
type strings instead of unicode
type. Because JSON is a subset of YAML it works nicely:
>>> import json>>> import yaml>>> list_org = ['a', 'b']>>> list_dump = json.dumps(list_org)>>> list_dump'["a", "b"]'>>> json.loads(list_dump)[u'a', u'b']>>> yaml.safe_load(list_dump)['a', 'b']
Notes
Some things to note though:
I get string objects because all my entries are ASCII encoded. If I would use unicode encoded entries, I would get them back as unicode objects — there is no conversion!
You should (probably always) use PyYAML's
safe_load
function; if you use it to load JSON files, you don't need the "additional power" of theload
function anyway.If you want a YAML parser that has more support for the 1.2 version of the spec (and correctly parses very low numbers) try Ruamel YAML:
pip install ruamel.yaml
andimport ruamel.yaml as yaml
was all I needed in my tests.
Conversion
As stated, there is no conversion! If you can't be sure to only deal with ASCII values (and you can't be sure most of the time), better use a conversion function:
I used the one from Mark Amery a couple of times now, it works great and is very easy to use. You can also use a similar function as an object_hook
instead, as it might gain you a performance boost on big files. See the slightly more involved answer from Mirec Miskuf for that.
There's no built-in option to make the json module functions return byte strings instead of unicode strings. However, this short and simple recursive function will convert any decoded JSON object from using unicode strings to UTF-8-encoded byte strings:
def byteify(input): if isinstance(input, dict): return {byteify(key): byteify(value) for key, value in input.iteritems()} elif isinstance(input, list): return [byteify(element) for element in input] elif isinstance(input, unicode): return input.encode('utf-8') else: return input
Just call this on the output you get from a json.load
or json.loads
call.
A couple of notes:
- To support Python 2.6 or earlier, replace
return {byteify(key): byteify(value) for key, value in input.iteritems()}
withreturn dict([(byteify(key), byteify(value)) for key, value in input.iteritems()])
, since dictionary comprehensions weren't supported until Python 2.7. - Since this answer recurses through the entire decoded object, it has a couple of undesirable performance characteristics that can be avoided with very careful use of the
object_hook
orobject_pairs_hook
parameters. Mirec Miskuf's answer is so far the only one that manages to pull this off correctly, although as a consequence, it's significantly more complicated than my approach.
A solution with object_hook
[edit]: Updated for Python 2.7 and 3.x compatibility.
import jsondef json_load_byteified(file_handle): return _byteify( json.load(file_handle, object_hook=_byteify), ignore_dicts=True )def json_loads_byteified(json_text): return _byteify( json.loads(json_text, object_hook=_byteify), ignore_dicts=True )def _byteify(data, ignore_dicts = False): if isinstance(data, str): return data # if this is a list of values, return list of byteified values if isinstance(data, list): return [ _byteify(item, ignore_dicts=True) for item in data ] # if this is a dictionary, return dictionary of byteified keys and values # but only if we haven't already byteified it if isinstance(data, dict) and not ignore_dicts: return { _byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True) for key, value in data.items() # changed to .items() for python 2.7/3 } # python 3 compatible duck-typing # if this is a unicode string, return its string representation if str(type(data)) == "<type 'unicode'>": return data.encode('utf-8') # if it's anything else, return it in its original form return data
Example usage:
>>> json_loads_byteified('{"Hello": "World"}'){'Hello': 'World'}>>> json_loads_byteified('"I am a top-level string"')'I am a top-level string'>>> json_loads_byteified('7')7>>> json_loads_byteified('["I am inside a list"]')['I am inside a list']>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')[[[[[[[['I am inside a big nest of lists']]]]]]]]>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}'){'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}>>> json_load_byteified(open('somefile.json')){'more json': 'from a file'}
How does this work and why would I use it?
Mark Amery's function is shorter and clearer than these ones, so what's the point of them? Why would you want to use them?
Purely for performance. Mark's answer decodes the JSON text fully first with unicode strings, then recurses through the entire decoded value to convert all strings to byte strings. This has a couple of undesirable effects:
- A copy of the entire decoded structure gets created in memory
- If your JSON object is really deeply nested (500 levels or more) then you'll hit Python's maximum recursion depth
This answer mitigates both of those performance issues by using the object_hook
parameter of json.load
and json.loads
. From the docs:
object_hook
is an optional function that will be called with the result of any object literal decoded (adict
). The return value of object_hook will be used instead of thedict
. This feature can be used to implement custom decoders
Since dictionaries nested many levels deep in other dictionaries get passed to object_hook
as they're decoded, we can byteify any strings or lists inside them at that point and avoid the need for deep recursion later.
Mark's answer isn't suitable for use as an object_hook
as it stands, because it recurses into nested dictionaries. We prevent that recursion in this answer with the ignore_dicts
parameter to _byteify
, which gets passed to it at all times except when object_hook
passes it a new dict
to byteify. The ignore_dicts
flag tells _byteify
to ignore dict
s since they already been byteified.
Finally, our implementations of json_load_byteified
and json_loads_byteified
call _byteify
(with ignore_dicts=True
) on the result returned from json.load
or json.loads
to handle the case where the JSON text being decoded doesn't have a dict
at the top level.