Group duplicate files (part 2)

Question 1

This is a continued discussion (from here => Group duplicate files) with new code and new thoughts/questions (see special question part for details for new questions), and I decide to make a new post.

Here is my source code which group duplicate files together, written in Python 2.7. Any advice on smarter ideas to group duplicate files together more efficient, it will be great. Any general advice on code bugs and code style are also appreciated.

Problem statement:

I am given a list of files e.g. ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]. I want to group all files which have the exact same content. Suppose in this example, file "1.txt", "2.txt", "3.txt" are the same, file "4.txt", "5.txt", "6.txt" have common header, but "4.txt", "6.txt" are exactly the same whole content. Then, the output should be two groups "1.txt", "2.txt", "3.txt" and "4.txt", "6.txt".

My major idea:

To avoid reading full content for each file, I generate a hash code for a file header (in my example, I define the file header to be the first 10 bytes of a file)
If more than 1 (>=2) files have the same header, I will read the full content and see if they are the exact same whole content -- I still generate hashing for whole content since if multiple files are the same hashing for header, there are many different potential duplication sub-groups, using hashing is easy to group true equal content file together.

Special question

I find below implementation still needs to read whole file content for any potential duplicate files. Wondering if any better ideas for improvement in terms of performance perspective?

Source code in Python 2.7

from collections import defaultdict
def read_whole_file(filename):
 with open(filename) as f:
 return hash(f.read())
def read_file_header(filename, header_length):
 with open(filename) as f:
 return hash(f.read(header_length))
def group_duplicate_files(filenames):
 header_buf = defaultdict(list)
 whole_buf = defaultdict(list)
 for f in filenames:
 header_buf[read_file_header(f,10)].append(f)
 for files in header_buf.values():
 if len(files) == 1:
 continue
 for f in files:
 whole_buf[read_whole_file(f)].append(f)
 return whole_buf.values()
if __name__ == "__main__":
 files = ["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]
 print group_duplicate_files(files)

Question 2

I see you got rid of the [val for val in whole_buf.values() if val] part. That one was needed to get rid of empty sub-groups left-over after the whole file comparisons.

Question 3

You can probably go for other "significant" parts before reading the whole thing. For example, first 10 chars (as you've done) followed by last 10 chars. You can get around the slowdown of more opening by not opening/closing the file each time, but just reading it, which would probably also improve speed significantly (at the cost of memory, but unless you have a lot of huge files, it's probably a good trade).

Question 4

For style commentary (as requested in the other thread): please add a little whitespace between functions and for loop blocks... :) Also, you're presumably using files because you already used filenames, but files really refers to groups of filenames. Moreover, what's special about those "files" isn't clear in the variable name. One suggestion might be collision_sets. Speaking of sets, the values could be stored as sets instead of appending to lists for a minor improvement.

Question 5

(*collision_set, that is, and before *"hashing" the whole thing in the previous comment...)

Question 6

@LukeSawczak, thanks for all the comments, if you could add a reply, I will mark it as answer to benefit other people in the future.

Question 7

To formalize my comments above as an answer (even though it will be an incomplete one as to possible optimizations):

You can check other "significant" parts of the files before hashing the whole thing. For example, as you've done, not only the first 10 characters but then the final 10 (even a middle 10 afterwards in case they also share identical closing headers).
To do the above, and as a bonus to optimize the I/O aspect, you can avoid re-opening each file, which is quite slow. Instead, keep them open and read as necessary, or even read a large chunk and then close, but don't reopen.
For style commentary, since you requested it in version 1 of this thread:
- Add a little whitespace between functions and for loop blocks, and generally between any self-contained block of 3-4 lines or more.
- You presumably named the variable files because you'd already used filenames, but files actually refers to groups of filenames. What's more, the variable name doesn't indicate what's special about those "files". One suggestion might be collision_set.
Speaking of sets, for a minor improvement, you could store the values in sets instead of appending to lists.

Luke Sawczak Luke Sawczak 2311 silver badge8 bronze badges · Answer 1 · 2017-04-02 19:18:11Z

To formalize my comments above as an answer (even though it will be an incomplete one as to possible optimizations):

You can check other "significant" parts of the files before hashing the whole thing. For example, as you've done, not only the first 10 characters but then the final 10 (even a middle 10 afterwards in case they also share identical closing headers).
To do the above, and as a bonus to optimize the I/O aspect, you can avoid re-opening each file, which is quite slow. Instead, keep them open and read as necessary, or even read a large chunk and then close, but don't reopen.
For style commentary, since you requested it in version 1 of this thread:
- Add a little whitespace between functions and for loop blocks, and generally between any self-contained block of 3-4 lines or more.
- You presumably named the variable files because you'd already used filenames, but files actually refers to groups of filenames. What's more, the variable name doesn't indicate what's special about those "files". One suggestion might be collision_set.
Speaking of sets, for a minor improvement, you could store the values in sets instead of appending to lists.

Stack Exchange Network

Group duplicate files (part 2)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Group duplicate files (part 2)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions