Get MD5 hash of big files in Python Get MD5 hash of big files in Python python python

Get MD5 hash of big files in Python


You need to read the file in chunks of suitable size:

def md5_for_file(f, block_size=2**20):    md5 = hashlib.md5()    while True:        data = f.read(block_size)        if not data:            break        md5.update(data)    return md5.digest()

NOTE: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.

So to do the whole lot in one method - use something like:

def generate_file_md5(rootdir, filename, blocksize=2**20):    m = hashlib.md5()    with open( os.path.join(rootdir, filename) , "rb" ) as f:        while True:            buf = f.read(blocksize)            if not buf:                break            m.update( buf )    return m.hexdigest()

The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 windows installation

I cross-checked the results using the 'jacksum' tool.

jacksum -a md5 <filename>

http://www.jonelo.de/java/jacksum/


Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using update().

This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you're not reading the entire file into memory, this won't use much more than 8192 bytes of memory.

In Python 3.8+ you can do

import hashlibwith open("your_filename.txt", "rb") as f:    file_hash = hashlib.md5()    while chunk := f.read(8192):        file_hash.update(chunk)print(file_hash.digest())print(file_hash.hexdigest())  # to get a printable str instead of bytes


Below I've incorporated suggestion from comments. Thank you all!

Python < 3.7

import hashlibdef checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):    h = hash_factory()    with open(filename,'rb') as f:         for chunk in iter(lambda: f.read(chunk_num_blocks*h.block_size), b''):             h.update(chunk)    return h.digest()

Python 3.8 and above

import hashlibdef checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):    h = hash_factory()    with open(filename,'rb') as f:         while chunk := f.read(chunk_num_blocks*h.block_size):             h.update(chunk)    return h.digest()

Original post

If you want a more Pythonic (no while True) way of reading the file check this code:

import hashlibdef checksum_md5(filename):    md5 = hashlib.md5()    with open(filename,'rb') as f:         for chunk in iter(lambda: f.read(8192), b''):             md5.update(chunk)    return md5.digest()

Note that the iter() function needs an empty byte string for the returned iterator to halt at EOF, since read() returns b'' (not just '').