Group duplicate files

Question 1

Here is my source code which group duplicate files together, written in Python 2.7. Any advice on smarter ideas to group duplicate files together more efficient, it will be great. Any general advice on code bugs and code style are also appreciated.

Problem statement:

I am given a list of files e.g. ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]. I want to group all files which have the exact same content. Suppose in this example, file "1.txt", "2.txt", "3.txt" are the same, file "4.txt", "5.txt", "6.txt" have common header, but "4.txt", "6.txt" are exactly the same whole content. Then, the output should be two groups "1.txt", "2.txt", "3.txt" and "4.txt", "6.txt".

My major idea:

To avoid reading full content for each file, I generate a hash code for a file header (in my example, I define the file header to be the first two bytes of a file)
If more than 2 (>=2) files have the same header, I will read the full content and see if they are the exact same whole content

from collections import defaultdict
import itertools
def group_duplicate_files(files):
 hash_buf = defaultdict(list)
 for f in files:
 hash_buf[hash_header(f)].append(f)
 result = []
 for file_list in hash_buf.values():
 if len(file_list) < 2:
 continue
 for file_pair in itertools.combinations(file_list,2):
 if compare_full_content(file_pair[0], file_pair[1]):
 n = set()
 n.add(file_pair[0])
 n.add(file_pair[1])
 found = False
 # try to merge with existing set
 for i,r in enumerate(result):
 if len(r.intersection(n)) > 0:
 found = True
 break
 if found:
 n = n.union(result[i])
 result.remove(result[i])
 result.append(n)
 else:
 result.append(n)
 return result
def hash_header(filename):
 f = open(filename, "r")
 header = f.read(10)
 return hash(header)
def compare_full_content(filename1, filename2):
 file1 = open(filename1, 'r')
 file2 = open(filename2, 'r')
 content1 = file1.read()
 content2 = file2.read()
 i = 0
 j = 0
 while i < len(content1) and j < len(content2):
 if content1[i] != content2[j]:
 return False
 i += 1
 j += 1
 if i < len(content1):
 return False
 if j < len(content2):
 return False
 return True
if __name__ == "__main__":
 files = ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]
 print group_duplicate_files(files)

Question 2

In the Major idea part, you write the contradictory statement "If more than 2 (>=2) files have the same header". I assume it is meant to be "if at least 2 (>=2) ...".

Question 3

@Graipher, yes, this is what I mean, thanks for the correction.

Question 4

As far as I can tell, your algorithm is something like \$\mathcal{O}(n!)\,ドル because you compare all possible pairs of files. It is possible to do this in \$\mathcal{O}(n)\$.

You just have to iterate over the files and order them into a hashmap (i.e. dictionary) based on their header hash. Then, resolve the possible conflicts of two files with the same header, but different content, which takes \$\mathcal{O}(k)\,ドル where \$k\$ is the number of files with the same header. So in the worst case (all files have the same header), this algorithm is \$\mathcal{O}(2n)\$.

from collections import defaultdict
import sys
def hash_content(file_name, length=-1):
 with open(file_name) as f:
 return hash(f.read(length))
def group_by_header(*file_names):
 file_groups = defaultdict(list)
 for file_name in file_names:
 file_groups[hash_content(file_name, length=10)].append(file_name)
 return file_groups
def group_by_content(*file_names):
 file_groups = group_by_header(*file_names)
 for file_names in file_groups.values():
 if len(file_names) > 1:
 while file_names:
 file_name = file_names.pop()
 file_groups[hash_content(file_name)].append(file_name)
 return filter(None, file_groups.values())
if __name__ == "__main__":
 file_names = sys.argv[1:]
 print group_by_content(*file_names)

This code has two caveats:

It assumes hash is collision-free (a fair assumption). We could pass the string content directly as a key and let python do the hashing. Then we would also get a strategy for the rare case that two strings have the same hash without being equal. But this way we avoid storing the possibly large file-contents as strings.
This uses the fact that dict.values() returns a list and not an iterator, so we can modify the dictionary in the loop here. This breaks in Python 3.x, but you specifically tagged this as Python 2.7. You can remedy this by introducing a new dictionary:

def group_by_content(*file_names):
 header_hashs = group_by_header(*file_names)
 content_hashs = defaultdict(list)
 for hash_, file_names in header_hashs.items():
 if len(file_names) > 1:
 while len(file_names):
 file_name = file_names.pop()
 content_hashs[hash_content(file_name)].append(file_name)
 else:
 content_hashs[hash_] = file_names
 return (val for val in content_hashs.values() if val)

Question 5

@LinMa Because you do for file_pair in itertools.combinations(file_list,2) when comparing with fullcontent.

Question 6

@LinMa you could do file_groups[open(file_name).read(10)].append(file_name), but as I said this would be big in memory if you also did it for the fulltext...

Question 7

@LinMa When comparing coe by content you have the advantage that you can stop as soon as you find a difference. On the other hand, you have to compare all combinations of pairs, which could possible be a lot of file reads and also a large number of comparisons. (Imagine n long files with the same header, all different only by the last character. You would have to read all files completely before knowing if they are different).

Question 8

@LinMa After thinking some more, I realized that combinations with length 2 will only generate binom(n+1, 2) = 0.5*(n^2 + n) comparisons. But this is still O(n^2).

Question 9

@LinMa I think my function will not be significantly slower, unless most of your files differ by the 11th byte and you don't have very many. But some testing would be needed for this.

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2017-03-20 08:58:20Z

As far as I can tell, your algorithm is something like \$\mathcal{O}(n!)\,ドル because you compare all possible pairs of files. It is possible to do this in \$\mathcal{O}(n)\$.

You just have to iterate over the files and order them into a hashmap (i.e. dictionary) based on their header hash. Then, resolve the possible conflicts of two files with the same header, but different content, which takes \$\mathcal{O}(k)\,ドル where \$k\$ is the number of files with the same header. So in the worst case (all files have the same header), this algorithm is \$\mathcal{O}(2n)\$.

from collections import defaultdict
import sys
def hash_content(file_name, length=-1):
 with open(file_name) as f:
 return hash(f.read(length))
def group_by_header(*file_names):
 file_groups = defaultdict(list)
 for file_name in file_names:
 file_groups[hash_content(file_name, length=10)].append(file_name)
 return file_groups
def group_by_content(*file_names):
 file_groups = group_by_header(*file_names)
 for file_names in file_groups.values():
 if len(file_names) > 1:
 while file_names:
 file_name = file_names.pop()
 file_groups[hash_content(file_name)].append(file_name)
 return filter(None, file_groups.values())
if __name__ == "__main__":
 file_names = sys.argv[1:]
 print group_by_content(*file_names)

This code has two caveats:

It assumes hash is collision-free (a fair assumption). We could pass the string content directly as a key and let python do the hashing. Then we would also get a strategy for the rare case that two strings have the same hash without being equal. But this way we avoid storing the possibly large file-contents as strings.
This uses the fact that dict.values() returns a list and not an iterator, so we can modify the dictionary in the loop here. This breaks in Python 3.x, but you specifically tagged this as Python 2.7. You can remedy this by introducing a new dictionary:

def group_by_content(*file_names):
 header_hashs = group_by_header(*file_names)
 content_hashs = defaultdict(list)
 for hash_, file_names in header_hashs.items():
 if len(file_names) > 1:
 while len(file_names):
 file_name = file_names.pop()
 content_hashs[hash_content(file_name)].append(file_name)
 else:
 content_hashs[hash_] = file_names
 return (val for val in content_hashs.values() if val)

@LinMa Because you do for file_pair in itertools.combinations(file_list,2) when comparing with fullcontent.
@LinMa you could do file_groups[open(file_name).read(10)].append(file_name), but as I said this would be big in memory if you also did it for the fulltext...
@LinMa When comparing coe by content you have the advantage that you can stop as soon as you find a difference. On the other hand, you have to compare all combinations of pairs, which could possible be a lot of file reads and also a large number of comparisons. (Imagine n long files with the same header, all different only by the last character. You would have to read all files completely before knowing if they are different).
@LinMa After thinking some more, I realized that combinations with length 2 will only generate binom(n+1, 2) = 0.5*(n^2 + n) comparisons. But this is still O(n^2).
@LinMa I think my function will not be significantly slower, unless most of your files differ by the 11th byte and you don't have very many. But some testing would be needed for this.

Stack Exchange Network

Group duplicate files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Group duplicate files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions