4
\$\begingroup\$

Here is my source code which group duplicate files together, written in Python 2.7. Any advice on smarter ideas to group duplicate files together more efficient, it will be great. Any general advice on code bugs and code style are also appreciated.

Problem statement:

I am given a list of files e.g. ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]. I want to group all files which have the exact same content. Suppose in this example, file "1.txt", "2.txt", "3.txt" are the same, file "4.txt", "5.txt", "6.txt" have common header, but "4.txt", "6.txt" are exactly the same whole content. Then, the output should be two groups "1.txt", "2.txt", "3.txt" and "4.txt", "6.txt".

My major idea:

  1. To avoid reading full content for each file, I generate a hash code for a file header (in my example, I define the file header to be the first two bytes of a file)
  2. If more than 2 (>=2) files have the same header, I will read the full content and see if they are the exact same whole content
from collections import defaultdict
import itertools
def group_duplicate_files(files):
 hash_buf = defaultdict(list)
 for f in files:
 hash_buf[hash_header(f)].append(f)
 result = []
 for file_list in hash_buf.values():
 if len(file_list) < 2:
 continue
 for file_pair in itertools.combinations(file_list,2):
 if compare_full_content(file_pair[0], file_pair[1]):
 n = set()
 n.add(file_pair[0])
 n.add(file_pair[1])
 found = False
 # try to merge with existing set
 for i,r in enumerate(result):
 if len(r.intersection(n)) > 0:
 found = True
 break
 if found:
 n = n.union(result[i])
 result.remove(result[i])
 result.append(n)
 else:
 result.append(n)
 return result
def hash_header(filename):
 f = open(filename, "r")
 header = f.read(10)
 return hash(header)
def compare_full_content(filename1, filename2):
 file1 = open(filename1, 'r')
 file2 = open(filename2, 'r')
 content1 = file1.read()
 content2 = file2.read()
 i = 0
 j = 0
 while i < len(content1) and j < len(content2):
 if content1[i] != content2[j]:
 return False
 i += 1
 j += 1
 if i < len(content1):
 return False
 if j < len(content2):
 return False
 return True
if __name__ == "__main__":
 files = ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]
 print group_duplicate_files(files)
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Mar 20, 2017 at 0:22
\$\endgroup\$
2
  • 1
    \$\begingroup\$ In the Major idea part, you write the contradictory statement "If more than 2 (>=2) files have the same header". I assume it is meant to be "if at least 2 (>=2) ...". \$\endgroup\$ Commented Mar 20, 2017 at 9:09
  • \$\begingroup\$ @Graipher, yes, this is what I mean, thanks for the correction. \$\endgroup\$ Commented Mar 23, 2017 at 20:29

1 Answer 1

3
\$\begingroup\$

As far as I can tell, your algorithm is something like \$\mathcal{O}(n!)\,ドル because you compare all possible pairs of files. It is possible to do this in \$\mathcal{O}(n)\$.

You just have to iterate over the files and order them into a hashmap (i.e. dictionary) based on their header hash. Then, resolve the possible conflicts of two files with the same header, but different content, which takes \$\mathcal{O}(k)\,ドル where \$k\$ is the number of files with the same header. So in the worst case (all files have the same header), this algorithm is \$\mathcal{O}(2n)\$.

from collections import defaultdict
import sys
def hash_content(file_name, length=-1):
 with open(file_name) as f:
 return hash(f.read(length))
def group_by_header(*file_names):
 file_groups = defaultdict(list)
 for file_name in file_names:
 file_groups[hash_content(file_name, length=10)].append(file_name)
 return file_groups
def group_by_content(*file_names):
 file_groups = group_by_header(*file_names)
 for file_names in file_groups.values():
 if len(file_names) > 1:
 while file_names:
 file_name = file_names.pop()
 file_groups[hash_content(file_name)].append(file_name)
 return filter(None, file_groups.values())
if __name__ == "__main__":
 file_names = sys.argv[1:]
 print group_by_content(*file_names)

This code has two caveats:

  1. It assumes hash is collision-free (a fair assumption). We could pass the string content directly as a key and let python do the hashing. Then we would also get a strategy for the rare case that two strings have the same hash without being equal. But this way we avoid storing the possibly large file-contents as strings.

  2. This uses the fact that dict.values() returns a list and not an iterator, so we can modify the dictionary in the loop here. This breaks in Python 3.x, but you specifically tagged this as Python 2.7. You can remedy this by introducing a new dictionary:

def group_by_content(*file_names):
 header_hashs = group_by_header(*file_names)
 content_hashs = defaultdict(list)
 for hash_, file_names in header_hashs.items():
 if len(file_names) > 1:
 while len(file_names):
 file_name = file_names.pop()
 content_hashs[hash_content(file_name)].append(file_name)
 else:
 content_hashs[hash_] = file_names
 return (val for val in content_hashs.values() if val)
answered Mar 20, 2017 at 8:58
\$\endgroup\$
9
  • 1
    \$\begingroup\$ @LinMa Because you do for file_pair in itertools.combinations(file_list,2) when comparing with fullcontent. \$\endgroup\$ Commented Mar 23, 2017 at 20:30
  • 1
    \$\begingroup\$ @LinMa you could do file_groups[open(file_name).read(10)].append(file_name), but as I said this would be big in memory if you also did it for the fulltext... \$\endgroup\$ Commented Mar 23, 2017 at 20:34
  • 1
    \$\begingroup\$ @LinMa When comparing coe by content you have the advantage that you can stop as soon as you find a difference. On the other hand, you have to compare all combinations of pairs, which could possible be a lot of file reads and also a large number of comparisons. (Imagine n long files with the same header, all different only by the last character. You would have to read all files completely before knowing if they are different). \$\endgroup\$ Commented Mar 24, 2017 at 14:22
  • 1
    \$\begingroup\$ @LinMa After thinking some more, I realized that combinations with length 2 will only generate binom(n+1, 2) = 0.5*(n^2 + n) comparisons. But this is still O(n^2). \$\endgroup\$ Commented Mar 24, 2017 at 14:24
  • 1
    \$\begingroup\$ @LinMa I think my function will not be significantly slower, unless most of your files differ by the 11th byte and you don't have very many. But some testing would be needed for this. \$\endgroup\$ Commented Mar 24, 2017 at 14:27

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.