Compute hash of only the core image data (excluding metadata) for an image
It is much easier to use the Python Imaging Library to extract the picture data (example in iPython):
In [1]: import ImageIn [2]: import hashlibIn [3]: im = Image.open('foo.jpg')In [4]: hashlib.md5(im.tobytes()).hexdigest()Out[4]: '171e2774b2549bbe0e18ed6dcafd04d5'
This works on any type of image that PIL can handle. The tobytes
method returns the a string containing the pixel data.
BTW, the MD5 hash is now seen as pretty weak. Better to use SHA512:
In [6]: hashlib.sha512(im.tobytes()).hexdigest()Out[6]: '6361f4a2722f221b277f81af508c9c1d0385d293a12958e2c56a57edf03da16f4e5b715582feef3db31200db67146a4b52ec3a8c445decfc2759975a98969c34'
On my machine, calculating the MD5 checksum for a 2500x1600 JPEG takes around 0.07 seconds. Using SHA512, it takes 0,10 seconds. Complete example:
#!/usr/bin/env python3from PIL import Imageimport hashlibimport sysim = Image.open(sys.argv[1])print(hashlib.sha512(im.tobytes()).hexdigest(), end="")
For movies, you can extract frames from them with e.g. ffmpeg, and then process them as shown above.
One simple way to do it is to hash the core image data. For PNG, you could do this by counting only the "critical chunks" (i.e. the ones starting with capital letters). JPEG has a similar but simpler file structure.
The visual hash in ImageMagick decompresses the image as it hashes it. In your case, you could hash the compressed image data right away, so (if implemented correctly) a it should be just as quick as hashing the raw file.
This is a small Python script illustrating the idea. It may or may not work for you, but it should at least give an indication to what I mean :)
import structimport osimport hashlibdef png(fh): hash = hashlib.md5() assert fh.read(8)[1:4] == "PNG" while True: try: length, = struct.unpack(">i",fh.read(4)) except struct.error: break if fh.read(4) == "IDAT": hash.update(fh.read(length)) fh.read(4) # CRC else: fh.seek(length+4,os.SEEK_CUR) print "Hash: %r" % hash.digest()def jpeg(fh): hash = hashlib.md5() assert fh.read(2) == "\xff\xd8" while True: marker,length = struct.unpack(">2H", fh.read(4)) assert marker & 0xff00 == 0xff00 if marker == 0xFFDA: # Start of stream hash.update(fh.read()) break else: fh.seek(length-2, os.SEEK_CUR) print "Hash: %r" % hash.digest()if __name__ == '__main__': png(file("sample.png")) jpeg(file("sample.jpg"))
You can use stream which is part of the ImageMagick suite:
$ stream -map rgb -storage-type short image.tif - | sha256sumd39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64 -
or
$ sha256sum <(stream -map rgb -storage-type short image.tif -)d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64 /dev/fd/63
This example is for a TIFF file which is RGB with 16 bits per sample (i.e. 48 bits per pixel). So I use map to rgb
and a short
storage-type (you can use char
here if the RGB values are 8-bits).
This method reports the same signature
hash that the verbose Imagemagick identify
command reports:
$ identify -verbose image.tif | grep signaturesignature: d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64
(for ImageMagick v6.x; the hash reported by identify
on version 7 is different to that obtained using stream
, but the latter may be reproduced by any tool capable of extracting the raw bitmap data - such as dcraw
for some image types.)