ZipFile.testzip() returning different results on Python 2 and Python 3 ZipFile.testzip() returning different results on Python 2 and Python 3 python python

ZipFile.testzip() returning different results on Python 2 and Python 3


The CRC value is OK. The CRC of 'vertnet_latest_amphibians.csv' recorded in the zip is 0x87203305. After extraction, this is indeed the CRC of the file.

However, the given uncompressed size is incorrect. The zip file records compressed size of 309,723,024 bytes, and uncompressed size of 292,198,614 bytes (that's smaller!). In reality, the uncompressed file is 4,587,165,910 bytes (4.3 GiB). This is bigger than the 4 GiB threshold where 32-bit counters break.

You can fix it like this (this worked in Python 3.5.2, at least):

archive = zipfile.ZipFile("vertnet_latest_amphibians.zip")archive.getinfo("vertnet_latest_amphibians.csv").file_size += 2**32archive.testzip() # now passesarchive.extract("vertnet_latest_amphibians.csv") # now works


I was unable to get Python 3 to extract from the archive. Some results from an investigation (on Mac OS X) that might be helpful.

Check the health of the archive

Make the file read-only in order to prevent accidental changes:

$ chmod -w vertnet_latest_amphibians.zip $ ls -lh vertnet_latest_amphibians.zip -r--r--r-- 1 lawh 2045336417 296M Jan  6 10:10 vertnet_latest_amphibians.zip

Check the archive using zip and unzip:

$ zip -T vertnet_latest_amphibians.ziptest of vertnet_latest_amphibians.zip OK$ unzip -t vertnet_latest_amphibians.zipArchive:  vertnet_latest_amphibians.zip    testing: VertNet_Amphibia_eml.xml   OK    testing: __MACOSX/                OK    testing: __MACOSX/._VertNet_Amphibia_eml.xml   OK    testing: vertnet_latest_amphibians.csv   OK    testing: __MACOSX/._vertnet_latest_amphibians.csv   OKNo errors detected in compressed data of vertnet_latest_amphibians.zip

As also found by @sam-mussmann, 7z reports a CRC error:

$ 7z t vertnet_latest_amphibians.zip 7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)Scanning the drive for archives:1 file, 309726398 bytes (296 MiB)Testing archive: vertnet_latest_amphibians.zip--Path = vertnet_latest_amphibians.zipType = zipPhysical Size = 309726398ERROR: CRC Failed : vertnet_latest_amphibians.csvSub items Errors: 1Archives with Errors: 1Sub items Errors: 1

My zip and unzip are both rather old; 7z is pretty new:

$ zip -v | head -2Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license.This is Zip 3.0 (July 5th 2008), by Info-ZIP.$ unzip -v | head -1UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.$ 7z --help |head -37-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)

Extract

Using unzip:

$ time unzip vertnet_latest_amphibians.zip vertnet_latest_amphibians.csvArchive:  vertnet_latest_amphibians.zip  inflating: vertnet_latest_amphibians.csv  real    0m17.201suser    0m14.281ssys 0m2.460s

Extract using Python 2.7.13, using zipfile's command-line interface for brevity:

$ time ~/local/python-2.7.13/bin/python2 -m zipfile -e vertnet_latest_amphibians.zip .real    0m19.491suser    0m12.996ssys 0m5.897s

As you found, Python 3.6.0 (also 3.4.5 and 3.5.2) reports a bad CRC

Hypothesis 1: The archive contains a bad CRC that zip, unzip andPython 2.7.13 are failing to detect; 7z and Python 3.4-3.6 are all doing theright thing.

Hypothesis 2: The archive is fine; 7z and Python 3.4-3.6 all contain a bug.

Given the relative ages of these tools, I would guess that H1 is correct.

Workaround

If you are not using Windows and trust the contents of the archive, it might be more straightforward to use regular shell commands. Something like:

wget <the-long-url> -O /tmp/vertnet_latest_amphibians.zipunzip /tmp/vertnet_latest_amphibians.zip vertnet_latest_amphibians.csvrm -rf /tmp/vertnet_latest_amphibians.zip

Or you could execute unzip from within Python:

import osos.system('unzip vertnet_latest_amphibians.zip vertnet_latest_amphibians.csv')

Incidental

It is slightly neater to catch ImportError than to check the version of thePython interpreter:

try:    from urllib.request import urlretrieveexcept ImportError:    from urllib import urlretrieve


As @Kundor, setting the file_size to the maximum (2**32 - 1) will work but fail for any file greater than 4 GiB(4 GiB minus 1 byte) hence set it to the maximum size for ZIP64 (16 EiB minus 1 byte)

Tested on (927MB compresed and 11GB of file_to_extract)

url:https://de.iplantcollaborative.org/anon-files//iplant/home/shared/commons_repo/curated/Vertnet_Aves_Sep2016/VertNet_Aves_Sept2016.zip

file: vertnet_latest_birds.csv

import zipfileimport urllibimport sysurl = "https://de.iplantcollaborative.org/anon-files//iplant/home/shared/commons_repo/curated/Vertnet_Amphibia_Sep2016/VertNet_Amphibia_Sept2016.zip"zip_path = "vertnet_latest_amphibians.zip"file_to_extract = "vertnet_latest_amphibians.csv"if sys.version_info >= (3, 0, 0):    urllib.request.urlretrieve(url, zip_path)else:    urllib.urlretrieve(url, zip_path)archive = zipfile.ZipFile(zip_path)if archive.testzip():    # reset uncompressed size header values to maximum    archive.getinfo(file_to_extract).file_size += (2 ** 64) - 1    open_archive_file = archive.open(file_to_extract, 'r')# or archive.extract(file_to_extract)