Reading utf-8 characters from a gzip file in python Reading utf-8 characters from a gzip file in python python python

Reading utf-8 characters from a gzip file in python


This is possible since Python 3.3:

import gzipgzip.open('file.gz', 'rt', encoding='utf-8')

Notice that gzip.open() requires you to explicitly specify text mode ('t').


I don't see why this should be so hard.

What are you doing exactly? Please explain "eventually it reads an invalid character".

It should be as simple as:

import gzipfp = gzip.open('foo.gz')contents = fp.read() # contents now has the uncompressed bytes of foo.gzfp.close()u_str = contents.decode('utf-8') # u_str is now a unicode string

EDITED

This answer works for Python2 in Python3, please see @SeppoEnarvi 's answer at https://stackoverflow.com/a/19794943/610569 (it uses the rt mode for gzip.open.


Maybe

import codecszf = gzip.open(fname, 'rb')reader = codecs.getreader("utf-8")contents = reader( zf )for line in contents:    pass