Checksum files in a directory with Python, memory-efficiently

Question 1

This is the implementation of a function hash_dir_contents that walks through a directory and for every file calculates the hash of its content in a memory-efficient manner (without loading the whole files into memory - which very surprisingly isn't already implemented in the standard library?).

hash_dir_contents can be used with different types of hash algorithms, e.g hashlib.md5 or hashlib.sha256. So the hasher_builder argument must be a constructor returning an object with update and digest methods.

There sadly seems to be no hashlib function for CRC32 that conforms to this, so the requisite class had to be manually created.

In tests, this so far worked, but the suspicion that an edge case was missed remains ...

import os
import typing
import zlib
class CRC32:
 def __init__(self) -> None:
 self.crc: int = 0
 def update(self, chunk: bytes) -> None:
 self.crc = zlib.crc32(chunk, self.crc)
 def digest(self) -> bytes:
 return self.crc.to_bytes(4, "big")
def file_chunks(file, chunksize=2**16):
 with file:
 chunk = file.read(chunksize)
 while len(chunk) > 0:
 yield chunk
 chunk = file.read(chunksize)
def hash_file(file: typing.IO[bytes], hasher) -> bytes:
 for chunk in file_chunks(file):
 hasher.update(chunk)
 return hasher.digest()
def hash_dir_contents(path: str, hasher_builder: typing.Callable) -> dict[str, bytes]:
 ret: dict[str, bytes] = {}
 for dirpath, _, filenames in os.walk(path):
 for filename in filenames:
 filepath = os.path.join(dirpath, filename)
 with open(filepath, "rb") as f:
 ret[filepath] = hash_file(f, hasher_builder())
 return ret

Question 2

Memory efficiency is sometimes sacrificed for time efficiency. Buffers, even large ones, can help with speed. So what do you really care about?

Question 3

@Reinderien difficult to precisely quantify. But the performance of this code is good enough, and AFAICS works correctly. This is Code Review, after all, not Stack Overflow. The naive implementation of loading the whole file first uses an insane amount of memory, obviously, and is even slower (because it triggers swapping or whatever).

Question 4

The hasher_builder Argument Should Not Be a "Constructor"

We know your interface requires that there be a class instance somewhere that implements two special methods. Your code expects that the hasher_builder argument to function hash_dir_contents is either:

The actual class. In this case the class's __init__ method must have no arguments other than self or arguments with default values. This is not very flexible.
A "factory" function taking zero arguments or arguments with default values and returning an instance of the required class. This strikes me as over-engineering or at least more indirect than is necessary.

It would be simpler and more flexible to require this argument to be an already constructed instance of a class that implements the required methods. We shouldn't care how this instance is created by the client.

Define an "Interface" Describing What `hash_file` Expects

We know that the required class must implement two special methods. How do we know that? We could look at your implementation line by line to see what methods are called. We don't really want a client to have to do that. You have chosen to include a class CRC32 that presumably is an example of a class that implements the required methods. But how do we know that since there are no comments or docstrings? You might need such a class definition if you want it to serve as a default implementation if the client doesn't supply their own. For example:

def hash_dir_contents(path: str,
 hasher_builder: typing.Callable = None
 ) -> dict[str, bytes]:
 if hasher_builder is None:
 hasher_builder = CRC32()
 ...

But since you are not doing that or including tests that use CRC32, having an unused class is not serving a particularly useful purpose and does not relieve you from having to provide informative docstrings. For example, you should add a docstring to hash_dir_contents describing what the expected argument types are and their semantics.

It might be useful to create an abstract base class that looks like your current CRC32 class except that all of the required methods are abstract and consist of only docstrings describing their semantics. Then the docstring for the hasher_builder arguments would specify the type as anything that implements the methods of this interface class without necessarily inheriting from that class.

Visibility

Any client of this module will only be interested in the hash_dir_contents function and the interface (abstract base class) you might have created as suggested. The other functions should probably be named so that they start with an underscore character to denote their private visibility and so that they are not imported if the client codes, for example, import * from contents_hasher.

Consider Converting Your Code to An Abstract Base Class

An alternate approach would be to create an abstract base class, for example, named ContentsHasher that provides abstract methods for the update and digest methods. For example:

from abc import ABC, abstractmethod
class ContentsHasher(ABC):
 def __init__(self, path: str):
 """Appropriate docstring."""
 
 self._path = path
 
 def hash_dir_contents(self) -> dict[str, bytes]:
 """Appropriate docstring."""
 ret: dict[str, bytes] = {}
 for dirpath, _, filenames in os.walk(self._path):
 for filename in filenames:
 filepath = os.path.join(dirpath, filename)
 with open(filepath, "rb") as f:
 ret[filepath] = self._hash_file(f)
 return ret
 
 def _file_chunks(self, file, chunksize=2**16):
 """Appropriate docstring."""
 with file:
 chunk = file.read(chunksize)
 while len(chunk) > 0:
 yield chunk
 chunk = file.read(chunksize)
 def _hash_file(self, file: typing.IO[bytes]) -> bytes:
 """Appropriate docstring."""
 
 for chunk in self._file_chunks(file):
 self.update(chunk)
 return self.digest()
 @abstractmethod
 def update(self, chunk: bytes) -> None:
 """Appropriate docstring."""
 @abstractmethod
 def digest(self) -> bytes:
 """Appropriate docstring."""

If you adopt this approach a client creates a subclass of this class and provides implementations for the abstract methods.

Question 5

I made some research online and I found this answer: Compute crc of file in python.

Your code seems correct, but here an optimized version inspired to the answer:

from os import walk
from os.path import join
from zlib import crc32 as zlib_crc32
def crc32(fileName):
 with open(fileName, 'rb') as fh:
 hash = 0
 while True:
 s = fh.read(65536)
 if not s:
 break
 hash = zlib_crc32(s, hash)
 return hash.to_bytes(4)
def hash_dir_contents(path: str) -> dict[str, bytes]:
 ret: dict[str, bytes] = {}
 for dirpath, _, filenames in walk(path):
 for filename in filenames:
 filepath = join(dirpath, filename)
 ret[filepath] = crc32(filepath)
 return ret

Question 6

Is quite efficient, for 3,38 GB folder it takes around 14 s and never mor than 4 MB of RAM.

Question 7

Is it really more efficient than the original solution? You also lose the flexibility that you can use another hash. BTW, I've seen implementations that differ and use a 32-bit mask, see here.

Booboo Booboo 3,1764 silver badges14 bronze badges · Accepted Answer · 2024-10-16 21:25:03Z

The hasher_builder Argument Should Not Be a "Constructor"

We know your interface requires that there be a class instance somewhere that implements two special methods. Your code expects that the hasher_builder argument to function hash_dir_contents is either:

The actual class. In this case the class's __init__ method must have no arguments other than self or arguments with default values. This is not very flexible.
A "factory" function taking zero arguments or arguments with default values and returning an instance of the required class. This strikes me as over-engineering or at least more indirect than is necessary.

It would be simpler and more flexible to require this argument to be an already constructed instance of a class that implements the required methods. We shouldn't care how this instance is created by the client.

Define an "Interface" Describing What `hash_file` Expects

We know that the required class must implement two special methods. How do we know that? We could look at your implementation line by line to see what methods are called. We don't really want a client to have to do that. You have chosen to include a class CRC32 that presumably is an example of a class that implements the required methods. But how do we know that since there are no comments or docstrings? You might need such a class definition if you want it to serve as a default implementation if the client doesn't supply their own. For example:

def hash_dir_contents(path: str,
 hasher_builder: typing.Callable = None
 ) -> dict[str, bytes]:
 if hasher_builder is None:
 hasher_builder = CRC32()
 ...

But since you are not doing that or including tests that use CRC32, having an unused class is not serving a particularly useful purpose and does not relieve you from having to provide informative docstrings. For example, you should add a docstring to hash_dir_contents describing what the expected argument types are and their semantics.

It might be useful to create an abstract base class that looks like your current CRC32 class except that all of the required methods are abstract and consist of only docstrings describing their semantics. Then the docstring for the hasher_builder arguments would specify the type as anything that implements the methods of this interface class without necessarily inheriting from that class.

Visibility

Any client of this module will only be interested in the hash_dir_contents function and the interface (abstract base class) you might have created as suggested. The other functions should probably be named so that they start with an underscore character to denote their private visibility and so that they are not imported if the client codes, for example, import * from contents_hasher.

Consider Converting Your Code to An Abstract Base Class

An alternate approach would be to create an abstract base class, for example, named ContentsHasher that provides abstract methods for the update and digest methods. For example:

from abc import ABC, abstractmethod
class ContentsHasher(ABC):
 def __init__(self, path: str):
 """Appropriate docstring."""
 
 self._path = path
 
 def hash_dir_contents(self) -> dict[str, bytes]:
 """Appropriate docstring."""
 ret: dict[str, bytes] = {}
 for dirpath, _, filenames in os.walk(self._path):
 for filename in filenames:
 filepath = os.path.join(dirpath, filename)
 with open(filepath, "rb") as f:
 ret[filepath] = self._hash_file(f)
 return ret
 
 def _file_chunks(self, file, chunksize=2**16):
 """Appropriate docstring."""
 with file:
 chunk = file.read(chunksize)
 while len(chunk) > 0:
 yield chunk
 chunk = file.read(chunksize)
 def _hash_file(self, file: typing.IO[bytes]) -> bytes:
 """Appropriate docstring."""
 
 for chunk in self._file_chunks(file):
 self.update(chunk)
 return self.digest()
 @abstractmethod
 def update(self, chunk: bytes) -> None:
 """Appropriate docstring."""
 @abstractmethod
 def digest(self) -> bytes:
 """Appropriate docstring."""

If you adopt this approach a client creates a subclass of this class and provides implementations for the abstract methods.

Stack Exchange Network

Checksum files in a directory with Python, memory-efficiently

2 Answers 2

The hasher_builder Argument Should Not Be a "Constructor"

Define an "Interface" Describing What `hash_file` Expects

Visibility

Consider Converting Your Code to An Abstract Base Class

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Checksum files in a directory with Python, memory-efficiently

2 Answers 2

The hasher_builder Argument Should Not Be a "Constructor"

Define an "Interface" Describing What hash_file Expects

Visibility

Consider Converting Your Code to An Abstract Base Class

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Define an "Interface" Describing What `hash_file` Expects