How to hash a large object (dataset) in Python? How to hash a large object (dataset) in Python? python python

How to hash a large object (dataset) in Python?


Thanks to John Montgomery I think I have found a solution, and I think it has less overhead than converting every number in possibly huge arrays to strings:

I can create a byte-view of the arrays and use these to update the hash. And somehow this seems to give the same digest as directly updating using the array:

>>> import hashlib>>> import numpy>>> a = numpy.random.rand(10, 100)>>> b = a.view(numpy.uint8)>>> print a.dtype, b.dtype # a and b have a different data typefloat64 uint8>>> hashlib.sha1(a).hexdigest() # byte view sha1'794de7b1316b38d989a9040e6e26b9256ca3b5eb'>>> hashlib.sha1(b).hexdigest() # array sha1'794de7b1316b38d989a9040e6e26b9256ca3b5eb'


What's the format of the data in the arrays? Couldn't you just iterate through the arrays, convert them into a string (via some reproducible means) and then feed that into your hash via update?

e.g.

import hashlibm = hashlib.md5() # or sha1 etcfor value in array: # array contains the data    m.update(str(value))

Don't forget though that numpy arrays won't provide __hash__() because they are mutable. So be careful not to modify the arrays after your calculated your hash (as it will no longer be the same).


There is a package for memoizing functions that use numpy arrays as inputs joblib. Found from this question.