Calculating MD5 Hash of Large Files in Python
When working with extremely large files, traditional methods of calculating MD5 hashes using the hashlib library become impractical as they require loading the entire file into memory. This approach may exhaust system resources, leading to errors and slowdowns.
Solution: Chunked Hashing
To address this issue, a technique called chunked hashing can be employed to compute MD5 hash incrementally without loading the entire file into memory. This involves:
Code Implementation:
The following Python function md5_for_file() implements chunked hashing:
<code class="python">def md5_for_file(f, block_size=2**20): md5 = hashlib.md5() while True: data = f.read(block_size) if not data: break md5.update(data) return md5.digest()</code>
To use this function, ensure you open the file with binary mode (rb).
Complete Method:
For convenience, here's a complete method generate_file_md5() that combines chunked hashing with file opening in one step:
<code class="python">def generate_file_md5(rootdir, filename, blocksize=2**20): m = hashlib.md5() with open(os.path.join(rootdir, filename), "rb") as f: while True: buf = f.read(blocksize) if not buf: break m.update(buf) return m.hexdigest()</code>
This method returns the hex-encoded MD5 hash of the specified file as a string. You can verify the results using external tools like jacksum for comparison.
The above is the detailed content of How to Calculate MD5 Hash of Large Files in Python Efficiently?. For more information, please follow other related articles on the PHP Chinese website!