How to read from a zip file within zip file in Python?
When you use the .open()
call on a ZipFile
instance you indeed get an open file handle. However, to read a zip file, the ZipFile
class needs a little more. It needs to be able to seek on that file, and the object returned by .open()
is not seekable in your case. Only Python 3 (3.2 and up) produces a ZipExFile
object that supports seeking (provided the underlying file handle for the outer zip file is seekable, and nothing is trying to write to the ZipFile
object).
The workaround is to read the whole zip entry into memory using .read()
, store it in a BytesIO
object (an in-memory file that is seekable) and feed that to ZipFile
:
from io import BytesIO# ... zfiledata = BytesIO(zfile.read(name)) with zipfile.ZipFile(zfiledata) as zfile2:
or, in the context of your example:
import zipfilefrom io import BytesIOwith zipfile.ZipFile("parent.zip", "r") as zfile: for name in zfile.namelist(): if re.search(r'\.zip$', name) is not None: # We have a zip within a zip zfiledata = BytesIO(zfile.read(name)) with zipfile.ZipFile(zfiledata) as zfile2: for name2 in zfile2.namelist(): # Now we can extract logging.info( "Found internal internal file: " + name2) print "Processing code goes here"
To get this to work with python33 (under windows but that might be unrelevant) i had to do :
import zipfile, re, io with zipfile.ZipFile(file, 'r') as zfile: for name in zfile.namelist(): if re.search(r'\.zip$', name) != None: zfiledata = io.BytesIO(zfile.read(name)) with zipfile.ZipFile(zfiledata) as zfile2: for name2 in zfile2.namelist(): print(name2)
cStringIO does not exist so i used io.BytesIO
Here's a function I came up with. (Copied from here.)
def extract_nested_zipfile(path, parent_zip=None): """Returns a ZipFile specified by path, even if the path contains intermediary ZipFiles. For example, /root/gparent.zip/parent.zip/child.zip will return a ZipFile that represents child.zip """ def extract_inner_zipfile(parent_zip, child_zip_path): """Returns a ZipFile specified by child_zip_path that exists inside parent_zip. """ memory_zip = StringIO() memory_zip.write(parent_zip.open(child_zip_path).read()) return zipfile.ZipFile(memory_zip) if ('.zip' + os.sep) in path: (parent_zip_path, child_zip_path) = os.path.relpath(path).split( '.zip' + os.sep, 1) parent_zip_path += '.zip' if not parent_zip: # This is the top-level, so read from disk parent_zip = zipfile.ZipFile(parent_zip_path) else: # We're already in a zip, so pull it out and recurse parent_zip = extract_inner_zipfile(parent_zip, parent_zip_path) return extract_nested_zipfile(child_zip_path, parent_zip) else: if parent_zip: return extract_inner_zipfile(parent_zip, path) else: # If there is no nesting, it's easy! return zipfile.ZipFile(path)
Here's how I tested it:
echo hello world > hi.txtzip wrap1.zip hi.txtzip wrap2.zip wrap1.zipzip wrap3.zip wrap2.zipprint extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap1.zip').open('hi.txt').read()print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap2.zip/wrap1.zip').open('hi.txt').read()print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap3.zip/wrap2.zip/wrap1.zip').open('hi.txt').read()