A friend of mine has the following issue. He has a nearly-full 1TB hard drive on which he keeps copies of all the pictures from his digital camera(s). However, he knows that he has copied the entire contents of the camera to this drive multiple times, with slightly similar--but mostly identical--contents every time. He would like to able to find out which files (mostly pictures) are duplicates.
While I started writing this for his application, I would like this to be a general-purpose script for my own use as well. Here is what I came up with.
Main Ideas
Building on the ideas of some posts on this site and on SO, I create a hash (albeit non-integer) for each file, consisting of (a) the size of the file, (b) some bytes read from the start of the file, and (c) an incrementable counter. This enables me to quickly check that two files are not identical, and then only compare the entire contents if the hashes are equivalent. In the case that two dissimilar files have the same hash, I increment the counter and try again.
- I realize that it may seem wasteful to open, read, and close every file when a simple size comparison might be initially sufficient; but I have found that this is generally not the case. In situations where there are many different similar-sized files, the algorithm approaches \$O(n^2)\$ where one must walk through all previous same-size files for each new one. Granted, these situations only arise in special cases (usually when traversing program files), but hitting one on accident tremendously slows down the program. Thus, taking the time initially to read some bytes can really pay off later. (YMMV, but if you use Chromium,
~/.cache/chromium/Default/Cache
clearly demonstrates this principle.) Of course, nothing is set in stone, and different hash functions may be used depending on the situation. In addition, other methods (see end of Feedback section), can be used to attempt to circumvent this issue.
The full-file comparison is done by filecmp
, which is very fast, although slight modification could make it faster (e.g. a larger cache, or skipping some initial checks which we already know).
Feedback
Of course, the main concern for this kind of program is speed. Is my algorithm efficient? Can it be improved? Are there edge|special-cases that I have overlooked?
- I considered that my hash method of using tuples may not be efficient, especially since very many look-ups need to be performed. However, I did not find a significant difference in speed using an equivalent integer hash method. (I tried a few things, but calling the built-in
hash
on the 3-tuple was the most straightforward.)
- I considered that my hash method of using tuples may not be efficient, especially since very many look-ups need to be performed. However, I did not find a significant difference in speed using an equivalent integer hash method. (I tried a few things, but calling the built-in
I am not an expert Python programmer. Is this program written and implemented well? Are there any code-smells or bad practices? Do I make good use of Python structures and idioms? What could I do better? What have I done well?
- One in particular, I am not fond of the large
try
block. However, due to the fact that there would have to be two nearly-identicaltry
blocks very close to each other (aroundfhash = hash_by_tuple(...)
andif filecmp.cmp(...)
), I thought it was better to pull it out around both.
- One in particular, I am not fond of the large
Any other feedback.
I think there are certainly things that can improve the usability the code. For example, adding support for a list of directories, files, or even regex matches to ignore would allow us to skip certain expensive searches, like ~/.local
and ~/.cache
. Or, conversely, a regex to include only certain kinds of files (e.g., r'[.]png$'
). However, this seems relatively straight-forward to implement, and it would clutter the code that I would really like to present, so I have not included it.
Code
import filecmp
import os
from time import time
def head (fpath, nbytes, offset=0):
""" Read `nbytes` from the beginning of a file, starting at
`offset`.
If there are less than `nbytes`, return whatever is there
(may be empty).
If there is not enough room for the `offset`, move back so
that we still return `nbytes` number of bytes.
"""
size = os.stat(fpath).st_size
with open(fpath,'rb') as f:
if size <= nbytes:
pass
elif size < offset+nbytes:
f.seek(-nbytes,2)
else:
f.seek(offset)
head = f.read(nbytes)
return head
def hash_by_tuple (fpath, nbytes=4, offset=16):
""" Create a 3-tuple hash of the file at `fpath`, consisting of
the file size, `nbytes` read from the file, and an
incrementable counter. Start reading the bytes at some
`offset` so that common headers (e.g. png/pdf) are not
included. The counter is always initially 0, and is intended
to be incremented on hash collisions to create a new, unique
hash.
"""
return os.stat(fpath).st_size, head(fpath,nbytes,offset), 0
def find_dups (top_dir, **hashkwargs):
""" Group duplicate files recursively in and below `top_dir`.
Returns a 6-tuple with the main return value and some metadata
as follows:
(0) The main dictionary of all files, with the hash value
from `hash_by_tuple` as the keys, and a list of the
file paths with identical content as the values.
(1) The total number of files included in the dict.
(2) The number of files skipped for various reasons.
Usually broken symlinks that are listed by `os.walk`,
but are not `os.stat`-able.
(3) The total size, in bytes, of all files in the dict.
(4) The `top_dir` input, for output formatting.
(5) The time it took to run the function, in seconds.
Optional `hashkwargs` can be passed to modify the behavior of
`hash_by_tuple`.
Note 1: The output of this function is best viewed using the
complementary function `format_find_dups_output`.
Note 2: All empty files are considered "identical" and are
mapped to `(0, b'', 0)`.
"""
t0 = time()
top_dir = os.path.abspath(os.path.expanduser(top_dir))
dups = {}
numfiles = numskipped = totsize = 0
for dirpath,_,fnames in os.walk(top_dir):
for fname in fnames:
fpath = os.path.join(dirpath,fname)
try:
fhash = hash_by_tuple(fpath,**hashkwargs)
while True:
if fhash not in dups:
# a new, unique file has been found.
dups[fhash] = [fpath]
break
# file is a duplicate, or hash collision occured.
if filecmp.cmp(fpath,dups[fhash][0],shallow=False):
# duplicate.
dups[fhash].append(fpath)
break
# hash collision on actually different files; rehash.
fhash = (fhash[0], fhash[1], fhash[2]+1)
except OSError:
numskipped += 1
continue
numfiles += 1
totsize += fhash[0]
return dups, numfiles, numskipped, totsize, top_dir, time()-t0
def format_find_dups_output (output, min_dups=1):
""" Return a human-readable formatted string directly from the
6-tuple `output` from `find_dups`. It can then either be
`print`-ed, or written to a file and read.
Set `min_dups` (default=1) to control the minimum number of
duplicates a file must have to be included in the returned
string. 0 will print every file found.
"""
dups, numfiles, numskipped, totsize, top_dir, runtime = output
header = ( 'In "{}", {} files were analyzed, totaling {} bytes, taking '
+ '{:.3g} seconds.\n'
+ '{} files were skipped because they failed analysis (typically '
+ 'broken symlinks).\n'
+ '{} unique files were found, with {} duplicates, averaging '
+ '{:.3g} duplicates per unique file.\n\n'
+ 'There are {} unique files with >= {} duplicates. In some '
+ 'particular order:\n\n' )
main_str = ''
numuniq_printing = 0
for fhash,fpaths in sorted(dups.items()):
if len(fpaths) > min_dups:
numuniq_printing += 1
if len(fpaths) == 1:
main_str += ( 'The following file, with the signature {}, '
+ 'is unique:\n ' ).format(fhash)
else:
main_str += ( 'The following {} files, with the signature {}, '
+ 'are identical:\n ' ).format(len(fpaths),fhash)
main_str += '\n '.join(fpaths) + '\n\n'
main_str += 'Done.'
header = header.format( top_dir, numfiles, totsize, runtime, numskipped
, len(dups), numfiles-len(dups)
, (numfiles-len(dups))/len(dups)
, numuniq_printing, min_dups )
return header+main_str
if __name__ == '__main__':
dups_output = find_dups('/a/path/with/maybe/1000/files/for/testing')
print(format_find_dups_output(dups_output,min_dups=1))
1 Answer 1
Others are better equipped than I to talk about the quality of the Python, but I can tell you that seeks are expensive, that all random reads cost about the same as long as you're reading less than a block, and that read+stat is worse than reading two sequential blocks, because the stat reads a directory and not the file.
I see about a 10% speedup if you hash the first 4k and use that as your key; another 5% from not computing total size (the only thing stat is needed for, once you drop size from the key):
def find_dups (top_dir, **hashkwargs):
t0 = time()
top_dir = os.path.abspath(os.path.expanduser(top_dir))
dups = {}
numfiles = numskipped = totsize = 0
for dirpath,_,fnames in os.walk(top_dir):
for fname in fnames:
fpath = os.path.join(dirpath,fname)
try:
with open(fpath,'rb') as f:
fhash = hash( f.read(4096) )
while True:
if fhash not in dups:
# a new, unique file has been found.
dups[fhash] = [fpath]
break
# file is a duplicate, or hash collision occured.
if filecmp.cmp(fpath,dups[fhash][0],shallow=False):
# duplicate.
dups[fhash].append(fpath)
break
# hash collision on actually different files; rehash.
fhash += 1
except OSError:
numskipped += 1
continue
numfiles += 1
return dups, numfiles, numskipped, totsize, top_dir, time()-t0
-
1\$\begingroup\$ Wow, this is excellent, I had no idea that
stat
andseek
were so expensive. I would have thought they were among the least expensive operations. Can you link any references that go into more detail? And your approach is even faster in my tests, where it is 20% to 3x faster. Could there be any issue with collisions causingdict
keys to exceed2**64
in this solution? \$\endgroup\$nivk– nivk2019年01月29日 04:38:40 +00:00Commented Jan 29, 2019 at 4:38 -
3\$\begingroup\$ Python integers have arbitrary precision and don't overflow. Is your data on a spinning drive? I was testing on an SSD, which have better seek:read speed ratios than HDD. The HDD would see more improvement from the elimination of 2 seeks per file. Seek is expensive on RAM and flash because it breaks locality (by definition) and thus defeats caching; it's expensive on HDD for the same reasons PLUS the physical need to have the platter spin to your data; it spins at the same speed whether reading or not. Stat is expensive simply because it implies a seek. \$\endgroup\$Oh My Goodness– Oh My Goodness2019年01月29日 04:51:38 +00:00Commented Jan 29, 2019 at 4:51
-
\$\begingroup\$ I've personally seen it a bunch of places. Basically, on a spinning drive, the drive heads are the slowest moving part - and they all move in tandem. Seek and stat make the heads move, more often than not. OMG is right about the disk spinning, but it's going much faster than the drive heads. \$\endgroup\$Ed Grimm– Ed Grimm2019年01月29日 05:00:13 +00:00Commented Jan 29, 2019 at 5:00
-
\$\begingroup\$ @OhMyGoodness Yes Python ints are not limited to 64 bit, and I didn't see any performance decrease for >2**64 keys. But I'm not sure if Python's
dict
don't do some perhaps-64-bit hashing on the keys behind the scenes--it seems unlikely, but one cannot be too careful about these things. Most of my testing is on an SSD, but my friend's drive is an external HDD. I would think it's favorable to optimize for HDDs, since they tend to be much larger. Although, there is no need to have a one-size-fits-all solution. \$\endgroup\$nivk– nivk2019年01月29日 06:15:59 +00:00Commented Jan 29, 2019 at 6:15 -
1\$\begingroup\$ @nivk Python dict's internal hashes are calculated from the keys softwaremaniacs.org/blog/2020/02/05/dicts-ordered \$\endgroup\$endolith– endolith2021年02月24日 17:54:03 +00:00Commented Feb 24, 2021 at 17:54
Explore related questions
See similar questions with these tags.
reinventing-the-wheel
tag, if you feel it deserves it. An interesting question: are those hash/comparison methods faster and as reliable as the proposed methods here? \$\endgroup\$