This is a continued discussion (from here => Group duplicate files) with new code and new thoughts/questions (see special question part for details for new questions), and I decide to make a new post.
Here is my source code which group duplicate files together, written in Python 2.7. Any advice on smarter ideas to group duplicate files together more efficient, it will be great. Any general advice on code bugs and code style are also appreciated.
Problem statement:
I am given a list of files e.g. ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]
. I want to group all files which have the exact same content. Suppose in this example, file "1.txt", "2.txt", "3.txt"
are the same, file "4.txt", "5.txt", "6.txt"
have common header, but "4.txt", "6.txt"
are exactly the same whole content. Then, the output should be two groups "1.txt", "2.txt", "3.txt"
and "4.txt", "6.txt"
.
My major idea:
- To avoid reading full content for each file, I generate a hash code for a file header (in my example, I define the file header to be the first
10
bytes of a file) - If more than 1 (>=2) files have the same header, I will read the full content and see if they are the exact same whole content -- I still generate hashing for whole content since if multiple files are the same hashing for header, there are many different potential duplication sub-groups, using hashing is easy to group true equal content file together.
Special question
I find below implementation still needs to read whole file content for any potential duplicate files. Wondering if any better ideas for improvement in terms of performance perspective?
Source code in Python 2.7
from collections import defaultdict
def read_whole_file(filename):
with open(filename) as f:
return hash(f.read())
def read_file_header(filename, header_length):
with open(filename) as f:
return hash(f.read(header_length))
def group_duplicate_files(filenames):
header_buf = defaultdict(list)
whole_buf = defaultdict(list)
for f in filenames:
header_buf[read_file_header(f,10)].append(f)
for files in header_buf.values():
if len(files) == 1:
continue
for f in files:
whole_buf[read_whole_file(f)].append(f)
return whole_buf.values()
if __name__ == "__main__":
files = ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]
print group_duplicate_files(files)
1 Answer 1
To formalize my comments above as an answer (even though it will be an incomplete one as to possible optimizations):
You can check other "significant" parts of the files before hashing the whole thing. For example, as you've done, not only the first 10 characters but then the final 10 (even a middle 10 afterwards in case they also share identical closing headers).
To do the above, and as a bonus to optimize the I/O aspect, you can avoid re-opening each file, which is quite slow. Instead, keep them open and read as necessary, or even read a large chunk and then close, but don't reopen.
For style commentary, since you requested it in version 1 of this thread:
- Add a little whitespace between functions and for loop blocks, and generally between any self-contained block of 3-4 lines or more.
- You presumably named the variable
files
because you'd already usedfilenames
, butfiles
actually refers to groups of filenames. What's more, the variable name doesn't indicate what's special about those "files". One suggestion might becollision_set
.
Speaking of sets, for a minor improvement, you could store the values in sets instead of appending to lists.
[val for val in whole_buf.values() if val]
part. That one was needed to get rid of empty sub-groups left-over after the whole file comparisons. \$\endgroup\$files
because you already usedfilenames
, butfiles
really refers to groups of filenames. Moreover, what's special about those "files" isn't clear in the variable name. One suggestion might becollision_sets
. Speaking of sets, the values could be stored as sets instead of appending to lists for a minor improvement. \$\endgroup\$collision_set
, that is, and before *"hashing" the whole thing in the previous comment...) \$\endgroup\$