Here is my source code which group duplicate files together, written in Python 2.7. Any advice on smarter ideas to group duplicate files together more efficient, it will be great. Any general advice on code bugs and code style are also appreciated.
Problem statement:
I am given a list of files e.g. ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]
. I want to group all files which have the exact same content. Suppose in this example, file "1.txt", "2.txt", "3.txt"
are the same, file "4.txt", "5.txt", "6.txt"
have common header, but "4.txt", "6.txt"
are exactly the same whole content. Then, the output should be two groups "1.txt", "2.txt", "3.txt"
and "4.txt", "6.txt"
.
My major idea:
- To avoid reading full content for each file, I generate a hash code for a file header (in my example, I define the file header to be the first two bytes of a file)
- If more than 2 (>=2) files have the same header, I will read the full content and see if they are the exact same whole content
from collections import defaultdict
import itertools
def group_duplicate_files(files):
hash_buf = defaultdict(list)
for f in files:
hash_buf[hash_header(f)].append(f)
result = []
for file_list in hash_buf.values():
if len(file_list) < 2:
continue
for file_pair in itertools.combinations(file_list,2):
if compare_full_content(file_pair[0], file_pair[1]):
n = set()
n.add(file_pair[0])
n.add(file_pair[1])
found = False
# try to merge with existing set
for i,r in enumerate(result):
if len(r.intersection(n)) > 0:
found = True
break
if found:
n = n.union(result[i])
result.remove(result[i])
result.append(n)
else:
result.append(n)
return result
def hash_header(filename):
f = open(filename, "r")
header = f.read(10)
return hash(header)
def compare_full_content(filename1, filename2):
file1 = open(filename1, 'r')
file2 = open(filename2, 'r')
content1 = file1.read()
content2 = file2.read()
i = 0
j = 0
while i < len(content1) and j < len(content2):
if content1[i] != content2[j]:
return False
i += 1
j += 1
if i < len(content1):
return False
if j < len(content2):
return False
return True
if __name__ == "__main__":
files = ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]
print group_duplicate_files(files)
-
1\$\begingroup\$ In the Major idea part, you write the contradictory statement "If more than 2 (>=2) files have the same header". I assume it is meant to be "if at least 2 (>=2) ...". \$\endgroup\$Graipher– Graipher2017年03月20日 09:09:20 +00:00Commented Mar 20, 2017 at 9:09
-
\$\begingroup\$ @Graipher, yes, this is what I mean, thanks for the correction. \$\endgroup\$Lin Ma– Lin Ma2017年03月23日 20:29:53 +00:00Commented Mar 23, 2017 at 20:29
1 Answer 1
As far as I can tell, your algorithm is something like \$\mathcal{O}(n!)\,ドル because you compare all possible pairs of files. It is possible to do this in \$\mathcal{O}(n)\$.
You just have to iterate over the files and order them into a hashmap (i.e. dictionary) based on their header hash. Then, resolve the possible conflicts of two files with the same header, but different content, which takes \$\mathcal{O}(k)\,ドル where \$k\$ is the number of files with the same header. So in the worst case (all files have the same header), this algorithm is \$\mathcal{O}(2n)\$.
from collections import defaultdict
import sys
def hash_content(file_name, length=-1):
with open(file_name) as f:
return hash(f.read(length))
def group_by_header(*file_names):
file_groups = defaultdict(list)
for file_name in file_names:
file_groups[hash_content(file_name, length=10)].append(file_name)
return file_groups
def group_by_content(*file_names):
file_groups = group_by_header(*file_names)
for file_names in file_groups.values():
if len(file_names) > 1:
while file_names:
file_name = file_names.pop()
file_groups[hash_content(file_name)].append(file_name)
return filter(None, file_groups.values())
if __name__ == "__main__":
file_names = sys.argv[1:]
print group_by_content(*file_names)
This code has two caveats:
It assumes
hash
is collision-free (a fair assumption). We could pass the string content directly as a key and let python do the hashing. Then we would also get a strategy for the rare case that two strings have the same hash without being equal. But this way we avoid storing the possibly large file-contents as strings.This uses the fact that
dict.values()
returns a list and not an iterator, so we can modify the dictionary in the loop here. This breaks in Python 3.x, but you specifically tagged this as Python 2.7. You can remedy this by introducing a new dictionary:
def group_by_content(*file_names):
header_hashs = group_by_header(*file_names)
content_hashs = defaultdict(list)
for hash_, file_names in header_hashs.items():
if len(file_names) > 1:
while len(file_names):
file_name = file_names.pop()
content_hashs[hash_content(file_name)].append(file_name)
else:
content_hashs[hash_] = file_names
return (val for val in content_hashs.values() if val)
-
1\$\begingroup\$ @LinMa Because you do
for file_pair in itertools.combinations(file_list,2)
when comparing with fullcontent. \$\endgroup\$Graipher– Graipher2017年03月23日 20:30:48 +00:00Commented Mar 23, 2017 at 20:30 -
1\$\begingroup\$ @LinMa you could do
file_groups[open(file_name).read(10)].append(file_name)
, but as I said this would be big in memory if you also did it for the fulltext... \$\endgroup\$Graipher– Graipher2017年03月23日 20:34:59 +00:00Commented Mar 23, 2017 at 20:34 -
1\$\begingroup\$ @LinMa When comparing coe by content you have the advantage that you can stop as soon as you find a difference. On the other hand, you have to compare all combinations of pairs, which could possible be a lot of file reads and also a large number of comparisons. (Imagine
n
long files with the same header, all different only by the last character. You would have to read all files completely before knowing if they are different). \$\endgroup\$Graipher– Graipher2017年03月24日 14:22:43 +00:00Commented Mar 24, 2017 at 14:22 -
1\$\begingroup\$ @LinMa After thinking some more, I realized that
combinations
with length 2 will only generatebinom(n+1, 2) = 0.5*(n^2 + n)
comparisons. But this is still O(n^2). \$\endgroup\$Graipher– Graipher2017年03月24日 14:24:44 +00:00Commented Mar 24, 2017 at 14:24 -
1\$\begingroup\$ @LinMa I think my function will not be significantly slower, unless most of your files differ by the 11th byte and you don't have very many. But some testing would be needed for this. \$\endgroup\$Graipher– Graipher2017年03月24日 14:27:33 +00:00Commented Mar 24, 2017 at 14:27