Downloading and unzipping a .zip file without writing to disk Downloading and unzipping a .zip file without writing to disk python python

Downloading and unzipping a .zip file without writing to disk


Below is a code snippet I used to fetch zipped csv file, please have a look:

Python 2:

from StringIO import StringIOfrom zipfile import ZipFilefrom urllib import urlopenresp = urlopen("http://www.test.com/file.zip")zipfile = ZipFile(StringIO(resp.read()))for line in zipfile.open(file).readlines():    print line

Python 3:

from io import BytesIOfrom zipfile import ZipFilefrom urllib.request import urlopen# or: requests.get(url).contentresp = urlopen("http://www.test.com/file.zip")zipfile = ZipFile(BytesIO(resp.read()))for line in zipfile.open(file).readlines():    print(line.decode('utf-8'))

Here file is a string. To get the actual string that you want to pass, you can use zipfile.namelist(). For instance,

resp = urlopen('http://mlg.ucd.ie/files/datasets/bbc.zip')zipfile = ZipFile(BytesIO(resp.read()))zipfile.namelist()# ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']


My suggestion would be to use a StringIO object. They emulate files, but reside in memory. So you could do something like this:

# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'import zipfilefrom StringIO import StringIOzipdata = StringIO()zipdata.write(get_zip_data())myzipfile = zipfile.ZipFile(zipdata)foofile = myzipfile.open('foo.txt')print foofile.read()# output: "hey, foo"

Or more simply (apologies to Vishal):

myzipfile = zipfile.ZipFile(StringIO(get_zip_data()))for name in myzipfile.namelist():    [ ... ]

In Python 3 use BytesIO instead of StringIO:

import zipfilefrom io import BytesIOfilebytes = BytesIO(get_zip_data())myzipfile = zipfile.ZipFile(filebytes)for name in myzipfile.namelist():    [ ... ]


I'd like to offer an updated Python 3 version of Vishal's excellent answer, which was using Python 2, along with some explanation of the adaptations / changes, which may have been already mentioned.

from io import BytesIOfrom zipfile import ZipFileimport urllib.request    url = urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/loc162txt.zip")with ZipFile(BytesIO(url.read())) as my_zip_file:    for contained_file in my_zip_file.namelist():        # with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output:        for line in my_zip_file.open(contained_file).readlines():            print(line)            # output.write(line)

Necessary changes:

  • There's no StringIO module in Python 3 (it's been moved to io.StringIO). Instead, I use io.BytesIO]2, because we will be handling a bytestream -- Docs, also this thread.
  • urlopen:

Note:

  • In Python 3, the printed output lines will look like so: b'some text'. This is expected, as they aren't strings - remember, we're reading a bytestream. Have a look at Dan04's excellent answer.

A few minor changes I made:

  • I use with ... as instead of zipfile = ... according to the Docs.
  • The script now uses .namelist() to cycle through all the files in the zip and print their contents.
  • I moved the creation of the ZipFile object into the with statement, although I'm not sure if that's better.
  • I added (and commented out) an option to write the bytestream to file (per file in the zip), in response to NumenorForLife's comment; it adds "unzipped_and_read_" to the beginning of the filename and a ".file" extension (I prefer not to use ".txt" for files with bytestrings). The indenting of the code will, of course, need to be adjusted if you want to use it.
    • Need to be careful here -- because we have a byte string, we use binary mode, so "wb"; I have a feeling that writing binary opens a can of worms anyway...
  • I am using an example file, the UN/LOCODE text archive:

What I didn't do:

  • NumenorForLife asked about saving the zip to disk. I'm not sure what he meant by it -- downloading the zip file? That's a different task; see Oleh Prypin's excellent answer.

Here's a way:

import urllib.requestimport shutilwith urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/2015-2_UNLOCODE_SecretariatNotes.pdf") as response, open("downloaded_file.pdf", 'w') as out_file:    shutil.copyfileobj(response, out_file)