Facebook JSON badly encoded Facebook JSON badly encoded python-3.x python-3.x

Facebook JSON badly encoded


I can indeed confirm that the Facebook download data is incorrectly encoded; a Mojibake. The original data is UTF-8 encoded but was decoded as Latin -1 instead. I’ll make sure to file a bug report.

In the meantime, you can repair the damage in two ways:

  1. Decode the data as JSON, then re-encode any strings as Latin-1, decode again as UTF-8:

    >>> import json>>> data = r'"Rados\u00c5\u0082aw"'>>> json.loads(data).encode('latin1').decode('utf8')'Radosław'
  2. Load the data as binary, replace all \u00hh sequences with the byte the last two hex digits represent, decode as UTF-8 and then decode as JSON:

    import refrom functools import partialfix_mojibake_escapes = partial(     re.compile(rb'\\u00([\da-f]{2})').sub,     lambda m: bytes.fromhex(m.group(1).decode()))with open(os.path.join(subdir, file), 'rb') as binary_data:    repaired = fix_mojibake_escapes(binary_data.read())data = json.loads(repaired.decode('utf8'))

    From your sample data this produces:

    {'content': 'No to trzeba ostatnie treningi zrobić xD', 'sender_name': 'Radosław', 'timestamp': 1524558089, 'type': 'Generic'}


Here is a command-line solution with jq and iconv. Tested on Linux.

cat message_1.json | jq . | iconv -f utf8 -t latin1 > m1.json


My solution for parsing objects use parse_hook callback on load/loads function:

import jsondef parse_obj(dct):    for key in dct:        dct[key] = dct[key].encode('latin_1').decode('utf-8')        pass    return dctdata = '{"msg": "Ahoj sv\u00c4\u009bte"}'# Stringjson.loads(data)  # Out: {'msg': 'Ahoj svÄ\x9bte'}json.loads(data, object_hook=parse_obj)  # Out: {'msg': 'Ahoj světe'}# Filewith open('/path/to/file.json') as f:     json.load(f, object_hook=parse_obj)     # Out: {'msg': 'Ahoj světe'}     pass

Update:

Solution for parsing list with strings does not working. So here is updated solution:

import jsondef parse_obj(obj):    for key in obj:        if isinstance(obj[key], str):            obj[key] = obj[key].encode('latin_1').decode('utf-8')        elif isinstance(obj[key], list):            obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))        pass    return obj