ZipFile.testzip() returning different results on Python 2 and Python 3
The CRC value is OK. The CRC of 'vertnet_latest_amphibians.csv' recorded in the zip is 0x87203305. After extraction, this is indeed the CRC of the file.
However, the given uncompressed size is incorrect. The zip file records compressed size of 309,723,024 bytes, and uncompressed size of 292,198,614 bytes (that's smaller!). In reality, the uncompressed file is 4,587,165,910 bytes (4.3 GiB). This is bigger than the 4 GiB threshold where 32-bit counters break.
You can fix it like this (this worked in Python 3.5.2, at least):
archive = zipfile.ZipFile("vertnet_latest_amphibians.zip")archive.getinfo("vertnet_latest_amphibians.csv").file_size += 2**32archive.testzip() # now passesarchive.extract("vertnet_latest_amphibians.csv") # now works
I was unable to get Python 3 to extract from the archive. Some results from an investigation (on Mac OS X) that might be helpful.
Check the health of the archive
Make the file read-only in order to prevent accidental changes:
$ chmod -w vertnet_latest_amphibians.zip $ ls -lh vertnet_latest_amphibians.zip -r--r--r-- 1 lawh 2045336417 296M Jan 6 10:10 vertnet_latest_amphibians.zip
Check the archive using zip
and unzip
:
$ zip -T vertnet_latest_amphibians.ziptest of vertnet_latest_amphibians.zip OK$ unzip -t vertnet_latest_amphibians.zipArchive: vertnet_latest_amphibians.zip testing: VertNet_Amphibia_eml.xml OK testing: __MACOSX/ OK testing: __MACOSX/._VertNet_Amphibia_eml.xml OK testing: vertnet_latest_amphibians.csv OK testing: __MACOSX/._vertnet_latest_amphibians.csv OKNo errors detected in compressed data of vertnet_latest_amphibians.zip
As also found by @sam-mussmann, 7z
reports a CRC error:
$ 7z t vertnet_latest_amphibians.zip 7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)Scanning the drive for archives:1 file, 309726398 bytes (296 MiB)Testing archive: vertnet_latest_amphibians.zip--Path = vertnet_latest_amphibians.zipType = zipPhysical Size = 309726398ERROR: CRC Failed : vertnet_latest_amphibians.csvSub items Errors: 1Archives with Errors: 1Sub items Errors: 1
My zip
and unzip
are both rather old; 7z
is pretty new:
$ zip -v | head -2Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license.This is Zip 3.0 (July 5th 2008), by Info-ZIP.$ unzip -v | head -1UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.$ 7z --help |head -37-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)
Extract
Using unzip
:
$ time unzip vertnet_latest_amphibians.zip vertnet_latest_amphibians.csvArchive: vertnet_latest_amphibians.zip inflating: vertnet_latest_amphibians.csv real 0m17.201suser 0m14.281ssys 0m2.460s
Extract using Python 2.7.13, using zipfile
's command-line interface for brevity:
$ time ~/local/python-2.7.13/bin/python2 -m zipfile -e vertnet_latest_amphibians.zip .real 0m19.491suser 0m12.996ssys 0m5.897s
As you found, Python 3.6.0 (also 3.4.5 and 3.5.2) reports a bad CRC
Hypothesis 1: The archive contains a bad CRC that zip
, unzip
andPython 2.7.13 are failing to detect; 7z
and Python 3.4-3.6 are all doing theright thing.
Hypothesis 2: The archive is fine; 7z
and Python 3.4-3.6 all contain a bug.
Given the relative ages of these tools, I would guess that H1 is correct.
Workaround
If you are not using Windows and trust the contents of the archive, it might be more straightforward to use regular shell commands. Something like:
wget <the-long-url> -O /tmp/vertnet_latest_amphibians.zipunzip /tmp/vertnet_latest_amphibians.zip vertnet_latest_amphibians.csvrm -rf /tmp/vertnet_latest_amphibians.zip
Or you could execute unzip
from within Python:
import osos.system('unzip vertnet_latest_amphibians.zip vertnet_latest_amphibians.csv')
Incidental
It is slightly neater to catch ImportError
than to check the version of thePython interpreter:
try: from urllib.request import urlretrieveexcept ImportError: from urllib import urlretrieve
As @Kundor, setting the file_size to the maximum (2**32 - 1) will work but fail for any file greater than 4 GiB(4 GiB minus 1 byte) hence set it to the maximum size for ZIP64 (16 EiB minus 1 byte)
Tested on (927MB compresed and 11GB of file_to_extract)
file: vertnet_latest_birds.csv
import zipfileimport urllibimport sysurl = "https://de.iplantcollaborative.org/anon-files//iplant/home/shared/commons_repo/curated/Vertnet_Amphibia_Sep2016/VertNet_Amphibia_Sept2016.zip"zip_path = "vertnet_latest_amphibians.zip"file_to_extract = "vertnet_latest_amphibians.csv"if sys.version_info >= (3, 0, 0): urllib.request.urlretrieve(url, zip_path)else: urllib.urlretrieve(url, zip_path)archive = zipfile.ZipFile(zip_path)if archive.testzip(): # reset uncompressed size header values to maximum archive.getinfo(file_to_extract).file_size += (2 ** 64) - 1 open_archive_file = archive.open(file_to_extract, 'r')# or archive.extract(file_to_extract)