Identify Duplicate Files

Question 1

A friend of mine has the following issue. He has a nearly-full 1TB hard drive on which he keeps copies of all the pictures from his digital camera(s). However, he knows that he has copied the entire contents of the camera to this drive multiple times, with slightly similar--but mostly identical--contents every time. He would like to able to find out which files (mostly pictures) are duplicates.

While I started writing this for his application, I would like this to be a general-purpose script for my own use as well. Here is what I came up with.

Main Ideas

Building on the ideas of some posts on this site and on SO, I create a hash (albeit non-integer) for each file, consisting of (a) the size of the file, (b) some bytes read from the start of the file, and (c) an incrementable counter. This enables me to quickly check that two files are not identical, and then only compare the entire contents if the hashes are equivalent. In the case that two dissimilar files have the same hash, I increment the counter and try again.

I realize that it may seem wasteful to open, read, and close every file when a simple size comparison might be initially sufficient; but I have found that this is generally not the case. In situations where there are many different similar-sized files, the algorithm approaches \$O(n^2)\$ where one must walk through all previous same-size files for each new one. Granted, these situations only arise in special cases (usually when traversing program files), but hitting one on accident tremendously slows down the program. Thus, taking the time initially to read some bytes can really pay off later. (YMMV, but if you use Chromium, ~/.cache/chromium/Default/Cache clearly demonstrates this principle.) Of course, nothing is set in stone, and different hash functions may be used depending on the situation. In addition, other methods (see end of Feedback section), can be used to attempt to circumvent this issue.

The full-file comparison is done by filecmp, which is very fast, although slight modification could make it faster (e.g. a larger cache, or skipping some initial checks which we already know).

Feedback

Of course, the main concern for this kind of program is speed. Is my algorithm efficient? Can it be improved? Are there edge|special-cases that I have overlooked?
- I considered that my hash method of using tuples may not be efficient, especially since very many look-ups need to be performed. However, I did not find a significant difference in speed using an equivalent integer hash method. (I tried a few things, but calling the built-in hash on the 3-tuple was the most straightforward.)
I am not an expert Python programmer. Is this program written and implemented well? Are there any code-smells or bad practices? Do I make good use of Python structures and idioms? What could I do better? What have I done well?
- One in particular, I am not fond of the large try block. However, due to the fact that there would have to be two nearly-identical try blocks very close to each other (around fhash = hash_by_tuple(...) and if filecmp.cmp(...)), I thought it was better to pull it out around both.
Any other feedback.

I think there are certainly things that can improve the usability the code. For example, adding support for a list of directories, files, or even regex matches to ignore would allow us to skip certain expensive searches, like ~/.local and ~/.cache. Or, conversely, a regex to include only certain kinds of files (e.g., r'[.]png$'). However, this seems relatively straight-forward to implement, and it would clutter the code that I would really like to present, so I have not included it.

Code

import filecmp
import os
from time import time
def head (fpath, nbytes, offset=0):
 """ Read `nbytes` from the beginning of a file, starting at
 `offset`.
 If there are less than `nbytes`, return whatever is there
 (may be empty).
 If there is not enough room for the `offset`, move back so
 that we still return `nbytes` number of bytes.
 """
 size = os.stat(fpath).st_size
 with open(fpath,'rb') as f:
 if size <= nbytes:
 pass
 elif size < offset+nbytes:
 f.seek(-nbytes,2)
 else:
 f.seek(offset)
 head = f.read(nbytes)
 return head
def hash_by_tuple (fpath, nbytes=4, offset=16):
 """ Create a 3-tuple hash of the file at `fpath`, consisting of
 the file size, `nbytes` read from the file, and an
 incrementable counter. Start reading the bytes at some
 `offset` so that common headers (e.g. png/pdf) are not
 included. The counter is always initially 0, and is intended
 to be incremented on hash collisions to create a new, unique
 hash.
 """
 return os.stat(fpath).st_size, head(fpath,nbytes,offset), 0
def find_dups (top_dir, **hashkwargs):
 """ Group duplicate files recursively in and below `top_dir`.
 Returns a 6-tuple with the main return value and some metadata
 as follows:
 (0) The main dictionary of all files, with the hash value
 from `hash_by_tuple` as the keys, and a list of the
 file paths with identical content as the values.
 (1) The total number of files included in the dict.
 (2) The number of files skipped for various reasons.
 Usually broken symlinks that are listed by `os.walk`,
 but are not `os.stat`-able.
 (3) The total size, in bytes, of all files in the dict.
 (4) The `top_dir` input, for output formatting.
 (5) The time it took to run the function, in seconds.
 Optional `hashkwargs` can be passed to modify the behavior of
 `hash_by_tuple`.
 Note 1: The output of this function is best viewed using the
 complementary function `format_find_dups_output`.
 Note 2: All empty files are considered "identical" and are
 mapped to `(0, b'', 0)`.
 """
 t0 = time()
 top_dir = os.path.abspath(os.path.expanduser(top_dir))
 dups = {}
 numfiles = numskipped = totsize = 0
 for dirpath,_,fnames in os.walk(top_dir):
 for fname in fnames:
 fpath = os.path.join(dirpath,fname)
 try:
 fhash = hash_by_tuple(fpath,**hashkwargs)
 while True:
 if fhash not in dups:
 # a new, unique file has been found.
 dups[fhash] = [fpath]
 break
 # file is a duplicate, or hash collision occured.
 if filecmp.cmp(fpath,dups[fhash][0],shallow=False):
 # duplicate.
 dups[fhash].append(fpath)
 break
 # hash collision on actually different files; rehash.
 fhash = (fhash[0], fhash[1], fhash[2]+1)
 except OSError:
 numskipped += 1
 continue
 numfiles += 1
 totsize += fhash[0]
 return dups, numfiles, numskipped, totsize, top_dir, time()-t0
def format_find_dups_output (output, min_dups=1):
 """ Return a human-readable formatted string directly from the
 6-tuple `output` from `find_dups`. It can then either be
 `print`-ed, or written to a file and read.
 Set `min_dups` (default=1) to control the minimum number of
 duplicates a file must have to be included in the returned
 string. 0 will print every file found.
 """
 dups, numfiles, numskipped, totsize, top_dir, runtime = output
 header = ( 'In "{}", {} files were analyzed, totaling {} bytes, taking '
 + '{:.3g} seconds.\n'
 + '{} files were skipped because they failed analysis (typically '
 + 'broken symlinks).\n'
 + '{} unique files were found, with {} duplicates, averaging '
 + '{:.3g} duplicates per unique file.\n\n'
 + 'There are {} unique files with >= {} duplicates. In some '
 + 'particular order:\n\n' )
 main_str = ''
 numuniq_printing = 0
 for fhash,fpaths in sorted(dups.items()):
 if len(fpaths) > min_dups:
 numuniq_printing += 1
 if len(fpaths) == 1:
 main_str += ( 'The following file, with the signature {}, '
 + 'is unique:\n ' ).format(fhash)
 else:
 main_str += ( 'The following {} files, with the signature {}, '
 + 'are identical:\n ' ).format(len(fpaths),fhash)
 main_str += '\n '.join(fpaths) + '\n\n'
 main_str += 'Done.'
 header = header.format( top_dir, numfiles, totsize, runtime, numskipped
 , len(dups), numfiles-len(dups)
 , (numfiles-len(dups))/len(dups)
 , numuniq_printing, min_dups )
 return header+main_str
if __name__ == '__main__':
 dups_output = find_dups('/a/path/with/maybe/1000/files/for/testing')
 print(format_find_dups_output(dups_output,min_dups=1))

Question 2

There are plenty of already made libraries for this, any reason you are not using them? They usually work with perceptual hashes which work better for detecting similar images, not needed if you only need exactly same image though.

Question 3

@juvian Sure a quick google search can bring up freeware to do this for you (please feel free to make recommendations!). I didn't consider that to be the safest or most flexible option. This is also partially instructional and entertaining for me, but feel free to suggest the reinventing-the-wheel tag, if you feel it deserves it. An interesting question: are those hash/comparison methods faster and as reliable as the proposed methods here?

Question 4

Depends under which metrics you want to compare them. Perceptual hashes do a lot more, it is able to identify as equal 2 images with different bytes when one is a re-escaling of the other, identify how similar 2 images are, and other things. The find exact file match with same bytes is just one of its less interesting features, and for that its just easier and more efficient to do something like you propose. Also you can get data such as image size by reading the first bytes

Question 5

@juvian That is very interesting, thank you! At the moment, I would like to be able to compare any kind of file (not just images), but a special case for image duplicates (including scaling) is interesting to consider, +1.

Question 6

Hashing only proves that two files are not the same; it doesn't prove that they are the same. (ship vs plane, PDF-1 vs PDF-2, etc.) Reading both files, computing a hash on each, and then reading them both a second time to compare them byte-by-byte is wasteful

Question 7

Others are better equipped than I to talk about the quality of the Python, but I can tell you that seeks are expensive, that all random reads cost about the same as long as you're reading less than a block, and that read+stat is worse than reading two sequential blocks, because the stat reads a directory and not the file.

I see about a 10% speedup if you hash the first 4k and use that as your key; another 5% from not computing total size (the only thing stat is needed for, once you drop size from the key):

def find_dups (top_dir, **hashkwargs):
 t0 = time()
 top_dir = os.path.abspath(os.path.expanduser(top_dir))
 dups = {}
 numfiles = numskipped = totsize = 0
 for dirpath,_,fnames in os.walk(top_dir):
 for fname in fnames:
 fpath = os.path.join(dirpath,fname)
 try:
 with open(fpath,'rb') as f:
 fhash = hash( f.read(4096) )
 while True:
 if fhash not in dups:
 # a new, unique file has been found.
 dups[fhash] = [fpath]
 break
 # file is a duplicate, or hash collision occured.
 if filecmp.cmp(fpath,dups[fhash][0],shallow=False):
 # duplicate.
 dups[fhash].append(fpath)
 break
 # hash collision on actually different files; rehash.
 fhash += 1
 except OSError:
 numskipped += 1
 continue
 numfiles += 1
 return dups, numfiles, numskipped, totsize, top_dir, time()-t0

Question 8

Wow, this is excellent, I had no idea that stat and seek were so expensive. I would have thought they were among the least expensive operations. Can you link any references that go into more detail? And your approach is even faster in my tests, where it is 20% to 3x faster. Could there be any issue with collisions causing dict keys to exceed 2**64 in this solution?

Question 9

Python integers have arbitrary precision and don't overflow. Is your data on a spinning drive? I was testing on an SSD, which have better seek:read speed ratios than HDD. The HDD would see more improvement from the elimination of 2 seeks per file. Seek is expensive on RAM and flash because it breaks locality (by definition) and thus defeats caching; it's expensive on HDD for the same reasons PLUS the physical need to have the platter spin to your data; it spins at the same speed whether reading or not. Stat is expensive simply because it implies a seek.

Question 10

I've personally seen it a bunch of places. Basically, on a spinning drive, the drive heads are the slowest moving part - and they all move in tandem. Seek and stat make the heads move, more often than not. OMG is right about the disk spinning, but it's going much faster than the drive heads.

Question 11

@OhMyGoodness Yes Python ints are not limited to 64 bit, and I didn't see any performance decrease for >2**64 keys. But I'm not sure if Python's dict don't do some perhaps-64-bit hashing on the keys behind the scenes--it seems unlikely, but one cannot be too careful about these things. Most of my testing is on an SSD, but my friend's drive is an external HDD. I would think it's favorable to optimize for HDDs, since they tend to be much larger. Although, there is no need to have a one-size-fits-all solution.

Question 12

@nivk Python dict's internal hashes are calculated from the keys softwaremaniacs.org/blog/2020/02/05/dicts-ordered

score 6 · Accepted Answer · 2019-01-29 02:34:15Z

6

\$\begingroup\$

Others are better equipped than I to talk about the quality of the Python, but I can tell you that seeks are expensive, that all random reads cost about the same as long as you're reading less than a block, and that read+stat is worse than reading two sequential blocks, because the stat reads a directory and not the file.

I see about a 10% speedup if you hash the first 4k and use that as your key; another 5% from not computing total size (the only thing stat is needed for, once you drop size from the key):

def find_dups (top_dir, **hashkwargs):
 t0 = time()
 top_dir = os.path.abspath(os.path.expanduser(top_dir))
 dups = {}
 numfiles = numskipped = totsize = 0
 for dirpath,_,fnames in os.walk(top_dir):
 for fname in fnames:
 fpath = os.path.join(dirpath,fname)
 try:
 with open(fpath,'rb') as f:
 fhash = hash( f.read(4096) )
 while True:
 if fhash not in dups:
 # a new, unique file has been found.
 dups[fhash] = [fpath]
 break
 # file is a duplicate, or hash collision occured.
 if filecmp.cmp(fpath,dups[fhash][0],shallow=False):
 # duplicate.
 dups[fhash].append(fpath)
 break
 # hash collision on actually different files; rehash.
 fhash += 1
 except OSError:
 numskipped += 1
 continue
 numfiles += 1
 return dups, numfiles, numskipped, totsize, top_dir, time()-t0

Share

answered Jan 29, 2019 at 2:34

Oh My Goodness's user avatar

Oh My Goodness Oh My GoodnessOh My Goodness

4,3461 gold badge12 silver badges26 bronze badges

\$\endgroup\$

9

1

\$\begingroup\$ Wow, this is excellent, I had no idea that stat and seek were so expensive. I would have thought they were among the least expensive operations. Can you link any references that go into more detail? And your approach is even faster in my tests, where it is 20% to 3x faster. Could there be any issue with collisions causing dict keys to exceed 2**64 in this solution? \$\endgroup\$

nivk
– nivk

2019年01月29日 04:38:40 +00:00
Commented Jan 29, 2019 at 4:38
3

\$\begingroup\$ Python integers have arbitrary precision and don't overflow. Is your data on a spinning drive? I was testing on an SSD, which have better seek:read speed ratios than HDD. The HDD would see more improvement from the elimination of 2 seeks per file. Seek is expensive on RAM and flash because it breaks locality (by definition) and thus defeats caching; it's expensive on HDD for the same reasons PLUS the physical need to have the platter spin to your data; it spins at the same speed whether reading or not. Stat is expensive simply because it implies a seek. \$\endgroup\$

Oh My Goodness
– Oh My Goodness

2019年01月29日 04:51:38 +00:00
Commented Jan 29, 2019 at 4:51
\$\begingroup\$ I've personally seen it a bunch of places. Basically, on a spinning drive, the drive heads are the slowest moving part - and they all move in tandem. Seek and stat make the heads move, more often than not. OMG is right about the disk spinning, but it's going much faster than the drive heads. \$\endgroup\$

Ed Grimm
– Ed Grimm

2019年01月29日 05:00:13 +00:00
Commented Jan 29, 2019 at 5:00
\$\begingroup\$ @OhMyGoodness Yes Python ints are not limited to 64 bit, and I didn't see any performance decrease for >2**64 keys. But I'm not sure if Python's dict don't do some perhaps-64-bit hashing on the keys behind the scenes--it seems unlikely, but one cannot be too careful about these things. Most of my testing is on an SSD, but my friend's drive is an external HDD. I would think it's favorable to optimize for HDDs, since they tend to be much larger. Although, there is no need to have a one-size-fits-all solution. \$\endgroup\$

nivk
– nivk

2019年01月29日 06:15:59 +00:00
Commented Jan 29, 2019 at 6:15
1

\$\begingroup\$ @nivk Python dict's internal hashes are calculated from the keys softwaremaniacs.org/blog/2020/02/05/dicts-ordered \$\endgroup\$

endolith
– endolith

2021年02月24日 17:54:03 +00:00
Commented Feb 24, 2021 at 17:54

| Show 4 more comments

Stack Exchange Network

Identify Duplicate Files

Main Ideas

Feedback

Code

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Identify Duplicate Files

Main Ideas

Feedback

Code

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions