Hashing a file in Python Hashing a file in Python python python

Hashing a file in Python


TL;DR use buffers to not use tons of memory.

We get to the crux of your problem, I believe, when we consider the memory implications of working with very large files. We don't want this bad boy to churn through 2 gigs of ram for a 2 gigabyte file so, as pasztorpisti points out, we gotta deal with those bigger files in chunks!

import sysimport hashlib# BUF_SIZE is totally arbitrary, change for your app!BUF_SIZE = 65536  # lets read stuff in 64kb chunks!md5 = hashlib.md5()sha1 = hashlib.sha1()with open(sys.argv[1], 'rb') as f:    while True:        data = f.read(BUF_SIZE)        if not data:            break        md5.update(data)        sha1.update(data)print("MD5: {0}".format(md5.hexdigest()))print("SHA1: {0}".format(sha1.hexdigest()))

What we've done is we're updating our hashes of this bad boy in 64kb chunks as we go along with hashlib's handy dandy update method. This way we use a lot less memory than the 2gb it would take to hash the guy all at once!

You can test this with:

$ mkfile 2g bigfile$ python hashes.py bigfileMD5: a981130cf2b7e09f4686dc273cf7187eSHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d$ md5 bigfileMD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e$ shasum bigfile91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

Hope that helps!

Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python


Addendum!

In general when writing python it helps to get into the habit of following pep-8. For example, in python variables are typically underscore separated not camelCased. But that's just style and no one really cares about those things except people who have to read bad style... which might be you reading this code years from now.


For the correct and efficient computation of the hash value of a file (in Python 3):

  • Open the file in binary mode (i.e. add 'b' to the filemode) to avoid character encoding and line-ending conversion issues.
  • Don't read the complete file into memory, since that is a waste of memory. Instead, sequentially read it block by block and update the hash for each block.
  • Eliminate double buffering, i.e. don't use buffered IO, because we already use an optimal block size.
  • Use readinto() to avoid buffer churning.

Example:

import hashlibdef sha256sum(filename):    h  = hashlib.sha256()    b  = bytearray(128*1024)    mv = memoryview(b)    with open(filename, 'rb', buffering=0) as f:        for n in iter(lambda : f.readinto(mv), 0):            h.update(mv[:n])    return h.hexdigest()


I would propose simply:

def get_digest(file_path):    h = hashlib.sha256()    with open(file_path, 'rb') as file:        while True:            # Reading is buffered, so we can read smaller chunks.            chunk = file.read(h.block_size)            if not chunk:                break            h.update(chunk)    return h.hexdigest()

All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.