Compute hash of only the core image data (excluding metadata) for an image Compute hash of only the core image data (excluding metadata) for an image python python

Compute hash of only the core image data (excluding metadata) for an image


It is much easier to use the Python Imaging Library to extract the picture data (example in iPython):

In [1]: import ImageIn [2]: import hashlibIn [3]: im = Image.open('foo.jpg')In [4]: hashlib.md5(im.tobytes()).hexdigest()Out[4]: '171e2774b2549bbe0e18ed6dcafd04d5'

This works on any type of image that PIL can handle. The tobytes method returns the a string containing the pixel data.

BTW, the MD5 hash is now seen as pretty weak. Better to use SHA512:

In [6]: hashlib.sha512(im.tobytes()).hexdigest()Out[6]: '6361f4a2722f221b277f81af508c9c1d0385d293a12958e2c56a57edf03da16f4e5b715582feef3db31200db67146a4b52ec3a8c445decfc2759975a98969c34'

On my machine, calculating the MD5 checksum for a 2500x1600 JPEG takes around 0.07 seconds. Using SHA512, it takes 0,10 seconds. Complete example:

#!/usr/bin/env python3from PIL import Imageimport hashlibimport sysim = Image.open(sys.argv[1])print(hashlib.sha512(im.tobytes()).hexdigest(), end="")

For movies, you can extract frames from them with e.g. ffmpeg, and then process them as shown above.


One simple way to do it is to hash the core image data. For PNG, you could do this by counting only the "critical chunks" (i.e. the ones starting with capital letters). JPEG has a similar but simpler file structure.

The visual hash in ImageMagick decompresses the image as it hashes it. In your case, you could hash the compressed image data right away, so (if implemented correctly) a it should be just as quick as hashing the raw file.

This is a small Python script illustrating the idea. It may or may not work for you, but it should at least give an indication to what I mean :)

import structimport osimport hashlibdef png(fh):    hash = hashlib.md5()    assert fh.read(8)[1:4] == "PNG"    while True:        try:            length, = struct.unpack(">i",fh.read(4))        except struct.error:            break        if fh.read(4) == "IDAT":            hash.update(fh.read(length))            fh.read(4) # CRC        else:            fh.seek(length+4,os.SEEK_CUR)    print "Hash: %r" % hash.digest()def jpeg(fh):    hash = hashlib.md5()    assert fh.read(2) == "\xff\xd8"    while True:        marker,length = struct.unpack(">2H", fh.read(4))        assert marker & 0xff00 == 0xff00        if marker == 0xFFDA: # Start of stream            hash.update(fh.read())            break        else:            fh.seek(length-2, os.SEEK_CUR)    print "Hash: %r" % hash.digest()if __name__ == '__main__':    png(file("sample.png"))    jpeg(file("sample.jpg"))


You can use stream which is part of the ImageMagick suite:

$ stream -map rgb -storage-type short image.tif - | sha256sumd39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64  -

or

$ sha256sum <(stream -map rgb -storage-type short image.tif -)d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64  /dev/fd/63

This example is for a TIFF file which is RGB with 16 bits per sample (i.e. 48 bits per pixel). So I use map to rgb and a short storage-type (you can use char here if the RGB values are 8-bits).

This method reports the same signature hash that the verbose Imagemagick identify command reports:

$ identify -verbose image.tif | grep signaturesignature: d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64

(for ImageMagick v6.x; the hash reported by identify on version 7 is different to that obtained using stream, but the latter may be reproduced by any tool capable of extracting the raw bitmap data - such as dcraw for some image types.)