This is the implementation of a function hash_dir_contents
that walks through a directory and for every file calculates the hash of its content in a memory-efficient manner (without loading the whole files into memory - which very surprisingly isn't already implemented in the standard library?).
hash_dir_contents
can be used with different types of hash algorithms, e.g hashlib.md5
or hashlib.sha256
. So the hasher_builder
argument must be a constructor returning an object with update
and digest
methods.
There sadly seems to be no hashlib
function for CRC32 that conforms to this, so the requisite class had to be manually created.
In tests, this so far worked, but the suspicion that an edge case was missed remains ...
import os
import typing
import zlib
class CRC32:
def __init__(self) -> None:
self.crc: int = 0
def update(self, chunk: bytes) -> None:
self.crc = zlib.crc32(chunk, self.crc)
def digest(self) -> bytes:
return self.crc.to_bytes(4, "big")
def file_chunks(file, chunksize=2**16):
with file:
chunk = file.read(chunksize)
while len(chunk) > 0:
yield chunk
chunk = file.read(chunksize)
def hash_file(file: typing.IO[bytes], hasher) -> bytes:
for chunk in file_chunks(file):
hasher.update(chunk)
return hasher.digest()
def hash_dir_contents(path: str, hasher_builder: typing.Callable) -> dict[str, bytes]:
ret: dict[str, bytes] = {}
for dirpath, _, filenames in os.walk(path):
for filename in filenames:
filepath = os.path.join(dirpath, filename)
with open(filepath, "rb") as f:
ret[filepath] = hash_file(f, hasher_builder())
return ret
-
\$\begingroup\$ Memory efficiency is sometimes sacrificed for time efficiency. Buffers, even large ones, can help with speed. So what do you really care about? \$\endgroup\$Reinderien– Reinderien2024年10月17日 02:58:36 +00:00Commented Oct 17, 2024 at 2:58
-
\$\begingroup\$ @Reinderien difficult to precisely quantify. But the performance of this code is good enough, and AFAICS works correctly. This is Code Review, after all, not Stack Overflow. The naive implementation of loading the whole file first uses an insane amount of memory, obviously, and is even slower (because it triggers swapping or whatever). \$\endgroup\$viuser– viuser2024年11月01日 23:09:56 +00:00Commented Nov 1, 2024 at 23:09
2 Answers 2
The hasher_builder Argument Should Not Be a "Constructor"
We know your interface requires that there be a class instance somewhere that implements two special methods. Your code expects that the hasher_builder argument to function hash_dir_contents
is either:
- The actual class. In this case the class's
__init__
method must have no arguments other thanself
or arguments with default values. This is not very flexible. - A "factory" function taking zero arguments or arguments with default values and returning an instance of the required class. This strikes me as over-engineering or at least more indirect than is necessary.
It would be simpler and more flexible to require this argument to be an already constructed instance of a class that implements the required methods. We shouldn't care how this instance is created by the client.
Define an "Interface" Describing What hash_file
Expects
We know that the required class must implement two special methods. How do we know that? We could look at your implementation line by line to see what methods are called. We don't really want a client to have to do that. You have chosen to include a class CRC32
that presumably is an example of a class that implements the required methods. But how do we know that since there are no comments or docstrings? You might need such a class definition if you want it to serve as a default implementation if the client doesn't supply their own. For example:
def hash_dir_contents(path: str,
hasher_builder: typing.Callable = None
) -> dict[str, bytes]:
if hasher_builder is None:
hasher_builder = CRC32()
...
But since you are not doing that or including tests that use CRC32
, having an unused class is not serving a particularly useful purpose and does not relieve you from having to provide informative docstrings. For example, you should add a docstring to hash_dir_contents
describing what the expected argument types are and their semantics.
It might be useful to create an abstract base class that looks like your current CRC32
class except that all of the required methods are abstract and consist of only docstrings describing their semantics. Then the docstring for the hasher_builder arguments would specify the type as anything that implements the methods of this interface class without necessarily inheriting from that class.
Visibility
Any client of this module will only be interested in the hash_dir_contents
function and the interface (abstract base class) you might have created as suggested. The other functions should probably be named so that they start with an underscore character to denote their private visibility and so that they are not imported if the client codes, for example, import * from contents_hasher
.
Consider Converting Your Code to An Abstract Base Class
An alternate approach would be to create an abstract base class, for example, named ContentsHasher
that provides abstract methods for the update
and digest
methods. For example:
from abc import ABC, abstractmethod
class ContentsHasher(ABC):
def __init__(self, path: str):
"""Appropriate docstring."""
self._path = path
def hash_dir_contents(self) -> dict[str, bytes]:
"""Appropriate docstring."""
ret: dict[str, bytes] = {}
for dirpath, _, filenames in os.walk(self._path):
for filename in filenames:
filepath = os.path.join(dirpath, filename)
with open(filepath, "rb") as f:
ret[filepath] = self._hash_file(f)
return ret
def _file_chunks(self, file, chunksize=2**16):
"""Appropriate docstring."""
with file:
chunk = file.read(chunksize)
while len(chunk) > 0:
yield chunk
chunk = file.read(chunksize)
def _hash_file(self, file: typing.IO[bytes]) -> bytes:
"""Appropriate docstring."""
for chunk in self._file_chunks(file):
self.update(chunk)
return self.digest()
@abstractmethod
def update(self, chunk: bytes) -> None:
"""Appropriate docstring."""
@abstractmethod
def digest(self) -> bytes:
"""Appropriate docstring."""
If you adopt this approach a client creates a subclass of this class and provides implementations for the abstract methods.
I made some research online and I found this answer: Compute crc of file in python.
Your code seems correct, but here an optimized version inspired to the answer:
from os import walk
from os.path import join
from zlib import crc32 as zlib_crc32
def crc32(fileName):
with open(fileName, 'rb') as fh:
hash = 0
while True:
s = fh.read(65536)
if not s:
break
hash = zlib_crc32(s, hash)
return hash.to_bytes(4)
def hash_dir_contents(path: str) -> dict[str, bytes]:
ret: dict[str, bytes] = {}
for dirpath, _, filenames in walk(path):
for filename in filenames:
filepath = join(dirpath, filename)
ret[filepath] = crc32(filepath)
return ret
-
\$\begingroup\$ Is quite efficient, for 3,38 GB folder it takes around 14 s and never mor than 4 MB of RAM. \$\endgroup\$AsrtoMichi– AsrtoMichi2024年10月05日 16:07:32 +00:00Commented Oct 5, 2024 at 16:07
-