Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError? Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError? python-3.x python-3.x

Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?


You have a corrupted data file. If that character really is meant to be a U+00AD SOFT HYPHEN, then you are missing a 0xC2 byte:

>>> '\u00ad'.encode('utf8')b'\xc2\xad'

Of all the possible UTF-8 encodings that end in 0xAD, a soft hyphen does make the most sense. However, it is indicative of a data set that may have other bytes missing. You just happened to have hit one that matters.

I'd go back to the source of this dataset and verify that the file was not corrupted when downloaded. Otherwise, using error='replace' is a viable work-around, provided no delimiters (tabs, newlines, etc.) are missing.

Another possibility is that the SEC is really using a different encoding for the file; for example in Windows Codepage 1252 and Latin-1, 0xAD is the correct encoding of a soft hyphen. And indeed, when I download the same dataset directly (warning, large ZIP file linked), and open tags.txt, I can't decode the data as UTF-8:

>>> open('/tmp/2017q1/tag.txt', encoding='utf8').read()Traceback (most recent call last):  File "<stdin>", line 1, in <module>  File "/.../lib/python3.6/codecs.py", line 321, in decode    (result, consumed) = self._buffer_decode(data, self.errors, final)UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte>>> from pprint import pprint>>> f = open('/tmp/2017q1/tag.txt', 'rb')>>> f.seek(3583550)3583550>>> pprint(f.read(100))(b'1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH INVESTING AND FINANCING A' b'CTIVITIES:\t\nProceedsFromSaleOfIn')

There are two such non-ASCII characters in the file:

>>> f.seek(0)0>>> pprint([l for l in f if any(b > 127 for b in l)])[b'SupplementalDisclosureOfNoncashInvestingAndFinancingActivitiesAbstract\t0' b'001654954-17-000551\t1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH I' b'NVESTING AND FINANCING ACTIVITIES:\t\n', b'HotelKranichhheMember\t0001558370-17-001446\t1\t0\tmember\tD\t\tHotel Krani' b'chhhe [Member]\tRepresents information pertaining to Hotel Kranichh\xf6h' b'e.\n']

Hotel Kranichh\xf6he decoded as Latin-1 is Hotel Kranichhöhe.

There are also several 0xC1 / 0xD1 pairs in the file:

>>> f.seek(0)0>>> quotes = [l for l in f if any(b in {0x1C, 0x1D} for b in l)]>>> quotes[0].split(b'\t')[-1][50:130]b'Temporary Payroll Tax Cut Continuation Act of 2011 (\x1cTCCA\x1d) recognized during th'>>> quotes[1].split(b'\t')[-1][50:130]b'ributory defined benefit pension plan (the \x1cAetna Pension Plan\x1d) to allow certai'

I'm betting those are really U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK characters; note the 1C and 1D parts. It almost feels as if their encoder took UTF-16 and stripped out all the high bytes, rather than encode to UTF-8 properly!

There is no codec shipping with Python that would encode '\u201C\u201D' to b'\x1C\x1D', making it all the more likely that the SEC has botched their encoding process somewhere. In fact, there are also 0x13 and 0x14 characters that are probably en and em dashes (U+2013 and U+2014), as well as 0x19 bytes that are almost certainly single quotes (U+2019). All that is missing to complete the picture is a 0x18 byte to represent U+2018.

If we assume that the encoding is broken, we can attempt to repair. The following code would read the file and fix the quotes issues, assuming that the rest of the data does not use characters outside of Latin-1 apart from the quotes:

_map = {    # dashes    0x13: '\u2013', 0x14: '\u2014',    # single quotes    0x18: '\u2018', 0x19: '\u2019',    # double quotes    0x1c: '\u201c', 0x1d: '\u201d',}def repair(line, _map=_map):    """Repair mis-encoded SEC data. Assumes line was decoded as Latin-1"""    return line.translate(_map)

then apply that to lines you read:

with open(filename, 'r', encoding='latin-1') as f:    repaired = map(repair, f)    fields = next(repaired).strip().split('\t')    for line in repaired:        yield process_tag_record(fields, line)

Separately, addressing your posted code, you are making Python work harder than it needs to. Don't use codecs.open(); that's legacy code that has known issues and is slower than the newer Python 3 I/O layer. Just use open(). Do not use f.readlines(); you don't need to read the whole file into a list here. Just iterate over the file directly:

def tags(filename):    """Yield Tag instances from tag.txt."""    with open(filename, 'r', encoding='utf-8', errors='strict') as f:        fields = next(f).strip().split('\t')        for line in f:            yield process_tag_record(fields, line)

If process_tag_record also splits on tabs, use a csv.reader() object and avoid splitting each row manually:

import csvdef tags(filename):    """Yield Tag instances from tag.txt."""    with open(filename, 'r', encoding='utf-8', errors='strict') as f:        reader = csv.reader(f, delimiter='\t')        fields = next(reader)        for row in reader:            yield process_tag_record(fields, row)

If process_tag_record combines the fields list with the values in row to form a dictionary, just use csv.DictReader() instead:

def tags(filename):    """Yield Tag instances from tag.txt."""    with open(filename, 'r', encoding='utf-8', errors='strict') as f:        reader = csv.DictReader(f, delimiter='\t')        # first row is used as keys for the dictionary, no need to read fields manually.        yield from reader