Get MD5 hash of big files in Python

You need to read the file in chunks of suitable size:

def md5_for_file(f, block_size=2**20):    md5 = hashlib.md5()    while True:        data = f.read(block_size)        if not data:            break        md5.update(data)    return md5.digest()

NOTE: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.

So to do the whole lot in one method - use something like:

def generate_file_md5(rootdir, filename, blocksize=2**20):    m = hashlib.md5()    with open( os.path.join(rootdir, filename) , "rb" ) as f:        while True:            buf = f.read(blocksize)            if not buf:                break            m.update( buf )    return m.hexdigest()

The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 windows installation

I cross-checked the results using the 'jacksum' tool.

jacksum -a md5 <filename>

http://www.jonelo.de/java/jacksum/

python md5 hashlib

Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using update().

This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you're not reading the entire file into memory, this won't use much more than 8192 bytes of memory.

In Python 3.8+ you can do

import hashlibwith open("your_filename.txt", "rb") as f:    file_hash = hashlib.md5()    while chunk := f.read(8192):        file_hash.update(chunk)print(file_hash.digest())print(file_hash.hexdigest())  # to get a printable str instead of bytes

python md5 hashlib

Below I've incorporated suggestion from comments. Thank you all!

Python < 3.7

import hashlibdef checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):    h = hash_factory()    with open(filename,'rb') as f:         for chunk in iter(lambda: f.read(chunk_num_blocks*h.block_size), b''):             h.update(chunk)    return h.digest()

Python 3.8 and above

import hashlibdef checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):    h = hash_factory()    with open(filename,'rb') as f:         while chunk := f.read(chunk_num_blocks*h.block_size):             h.update(chunk)    return h.digest()

Original post

If you want a more Pythonic (no while True) way of reading the file check this code:

import hashlibdef checksum_md5(filename):    md5 = hashlib.md5()    with open(filename,'rb') as f:         for chunk in iter(lambda: f.read(8192), b''):             md5.update(chunk)    return md5.digest()

Note that the iter() function needs an empty byte string for the returned iterator to halt at EOF, since read() returns b'' (not just '').

CodeHunter

Get MD5 hash of big files in Python

Python < 3.7

Python 3.8 and above

Original post

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last