Python: Creating a streaming gzip'd file-like? Python: Creating a streaming gzip'd file-like? python python

Python: Creating a streaming gzip'd file-like?


It's quite kludgy (self referencing, etc; just put a few minutes writing it, nothing really elegant), but it does what you want if you're still interested in using gzip instead of zlib directly.

Basically, GzipWrap is a (very limited) file-like object that produces a gzipped file out of a given iterable (e.g., a file-like object, a list of strings, any generator...)

Of course, it produces binary so there was no sense in implementing "readline".

You should be able to expand it to cover other cases or to be used as an iterable object itself.

from gzip import GzipFileclass GzipWrap(object):    # input is a filelike object that feeds the input    def __init__(self, input, filename = None):        self.input = input        self.buffer = ''        self.zipper = GzipFile(filename, mode = 'wb', fileobj = self)    def read(self, size=-1):        if (size < 0) or len(self.buffer) < size:            for s in self.input:                self.zipper.write(s)                if size > 0 and len(self.buffer) >= size:                    self.zipper.flush()                    break            else:                self.zipper.close()            if size < 0:                ret = self.buffer                self.buffer = ''        else:            ret, self.buffer = self.buffer[:size], self.buffer[size:]        return ret    def flush(self):        pass    def write(self, data):        self.buffer += data    def close(self):        self.input.close()


Here is a cleaner, non-self-referencing version based on Ricardo Cárdenes' very helpful answer.

from gzip import GzipFilefrom collections import dequeCHUNK = 16 * 1024class Buffer (object):    def __init__ (self):        self.__buf = deque()        self.__size = 0    def __len__ (self):        return self.__size    def write (self, data):        self.__buf.append(data)        self.__size += len(data)    def read (self, size=-1):        if size < 0: size = self.__size        ret_list = []        while size > 0 and len(self.__buf):            s = self.__buf.popleft()            size -= len(s)            ret_list.append(s)        if size < 0:            ret_list[-1], remainder = ret_list[-1][:size], ret_list[-1][size:]            self.__buf.appendleft(remainder)        ret = ''.join(ret_list)        self.__size -= len(ret)        return ret    def flush (self):        pass    def close (self):        passclass GzipCompressReadStream (object):    def __init__ (self, fileobj):        self.__input = fileobj        self.__buf = Buffer()        self.__gzip = GzipFile(None, mode='wb', fileobj=self.__buf)    def read (self, size=-1):        while size < 0 or len(self.__buf) < size:            s = self.__input.read(CHUNK)            if not s:                self.__gzip.close()                break            self.__gzip.write(s)        return self.__buf.read(size)

Advantages:

  • Avoids repeated string concatenation, which would cause the entire string to be copied repeatedly.
  • Reads a fixed CHUNK size from the input stream, instead of reading whole lines at a time (which can be arbitrarily long).
  • Avoids circular references.
  • Avoids misleading public "write" method of GzipCompressStream(), which is really only used internally.
  • Takes advantage of name mangling for internal member variables.


The gzip module supports compressing to a file-like object, pass a fileobj parameter to GzipFile, as well as a filename. The filename you pass in doesn't need to exist, but the gzip header has a filename field which needs to be filled out.

Update

This answer does not work. Example:

# tmp/try-gzip.py import sysimport gzipfd=gzip.GzipFile(fileobj=sys.stdin)sys.stdout.write(fd.read())

output:

===> cat .bash_history  | python tmp/try-gzip.py  > tmp/history.gzipTraceback (most recent call last):  File "tmp/try-gzip.py", line 7, in <module>    sys.stdout.write(fd.read())  File "/usr/lib/python2.7/gzip.py", line 254, in read    self._read(readsize)  File "/usr/lib/python2.7/gzip.py", line 288, in _read    pos = self.fileobj.tell()   # Save current positionIOError: [Errno 29] Illegal seek