Get MD5 hash of big files in Python
You need to read the file in chunks of suitable size:
def md5_for_file(f, block_size=2**20): md5 = hashlib.md5() while True: data = f.read(block_size) if not data: break md5.update(data) return md5.digest()
NOTE: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.
So to do the whole lot in one method - use something like:
def generate_file_md5(rootdir, filename, blocksize=2**20): m = hashlib.md5() with open( os.path.join(rootdir, filename) , "rb" ) as f: while True: buf = f.read(blocksize) if not buf: break m.update( buf ) return m.hexdigest()
The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 windows installation
I cross-checked the results using the 'jacksum' tool.
jacksum -a md5 <filename>
Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using update()
.
This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you're not reading the entire file into memory, this won't use much more than 8192 bytes of memory.
In Python 3.8+ you can do
import hashlibwith open("your_filename.txt", "rb") as f: file_hash = hashlib.md5() while chunk := f.read(8192): file_hash.update(chunk)print(file_hash.digest())print(file_hash.hexdigest()) # to get a printable str instead of bytes
Below I've incorporated suggestion from comments. Thank you all!
Python < 3.7
import hashlibdef checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128): h = hash_factory() with open(filename,'rb') as f: for chunk in iter(lambda: f.read(chunk_num_blocks*h.block_size), b''): h.update(chunk) return h.digest()
Python 3.8 and above
import hashlibdef checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128): h = hash_factory() with open(filename,'rb') as f: while chunk := f.read(chunk_num_blocks*h.block_size): h.update(chunk) return h.digest()
Original post
If you want a more Pythonic (no while True
) way of reading the file check this code:
import hashlibdef checksum_md5(filename): md5 = hashlib.md5() with open(filename,'rb') as f: for chunk in iter(lambda: f.read(8192), b''): md5.update(chunk) return md5.digest()
Note that the iter()
function needs an empty byte string for the returned iterator to halt at EOF, since read()
returns b''
(not just ''
).