Edit: it seems flow based (or reactive) programming approach would help here; there are some libraries in python that try that.
I tried to follow the generator pipeline style (see David Beazley's famous presentation) for finding duplicate files (similar to answers here). It seems pretty straightforward with MapReduce, so I thought it should also be possible to produce a clean, simple code with a generator pipeline. I tried both lambdas and named functions, but couldn't find how.
My code is especially ugly in get_digest
where it's detecting that the digest
has been completed for a particular file. Also annoying is the boilerplate code I use to propagate the source filepath
through the pipeline.
Of course, I can just rewrite everything with nested loops, but I thought maybe I'm missing an obvious approach? Perhaps I should try to invert the data flow direction by using coroutines? Or use some itertools
-style techniques to create a non-tree shaped data flow?
# python 3.5 but nothing important lost if I port it to python 2.7
import os
import glob
import collections
import hashlib
import functools
BUFFER_SIZE = 2 ** 20
def get_files(filepaths):
for filepath in filepaths:
yield open(filepath, mode='rb'), filepath
def read_files(files):
for file, filepath in files:
for data in iter(functools.partial(file.read, BUFFER_SIZE), b''):
yield data, filepath
def get_digests(data_iter):
current_filepath = None
hash_obj = hashlib.sha256()
for data, filepath in data_iter:
if filepath != current_filepath:
if current_filepath is not None:
yield hash_obj.digest(), current_filepath
current_filepath = filepath
hash_obj.update(data)
yield hash_obj.digest(), current_filepath
def find_duplicates(root_folder):
'''
Args:
root_folder: folder to start searching from
Returns:
a list of lists of paths that correspond to duplicate files
'''
# combine generators into a pipeline
paths = glob.iglob(os.path.join(root_folder, '**'), recursive=True)
filepaths = filter(os.path.isfile, paths)
files = get_files(filepaths)
data_iter = read_files(files)
digests = get_digests(data_iter)
# collect data into a dictionary, then list
# I feel this part is ok
duplicates = collections.defaultdict(list)
for digest, filepath in digests:
duplicates[digest].append(filepath)
return [v for v in duplicates.values() if len(v) >=2]
# print duplicate files found in the current folder or below
duplicates = find_duplicates('.')
for group in duplicates:
print('the following files are duplicates:')
for filename in group:
print(filename)
print('\n')
Update:
Here's the slightly modified code from @ferada answer (who fixed a bug in my code and made my code much cleaner). Per @ferada suggestion, I made get_digest
just deal with digest calculation, and factored out the grouping code.
import pprint, os, glob, collections, hashlib, functools, itertools, sys, operator
BUFFER_SIZE = 2 ** 20
def read_files(filepaths):
for filepath in filepaths:
with open(filepath, mode='rb') as file:
for data in iter(functools.partial(file.read, BUFFER_SIZE), b''):
yield data, filepath
def get_digest(hash_obj, iterator):
for data in iterator:
hash_obj.update(data)
return hash_obj.digest()
def get_digests(data_iter):
for filepath, group in itertools.groupby(data_iter, key=lambda x: x[1]):
yield get_digest(hashlib.sha256(), map(operator.itemgetter(0), group)), filepath
def scantree(path):
"""Recursively yield DirEntry objects for given directory.
From https://stackoverflow.com/a/33135143/336527
"""
with os.scandir(path) as it:
for entry in it:
if entry.is_dir(follow_symlinks=False):
yield from scantree(entry.path) # see below for Python 2.x
else:
yield entry
def find_files(root_folder):
'''Yields full paths of all files starting with root_folder, recursively'''
for entry in scantree(root_folder):
if entry.is_file():
yield entry.path
def find_duplicates(root_folder):
'''
Args:
root_folder: folder to start searching from
Yields:
Tuples of paths that correspond to duplicate files
'''
filepaths = find_files(root_folder)
data_iter = read_files(filepaths)
digests = get_digests(data_iter)
for _, group in itertools.groupby(digests, key=lambda x: x[0]):
_, filepaths = zip(*group)
if len(filepaths) >= 2:
yield filepaths
def main():
folder = sys.argv[1]
for dup in find_duplicates(folder):
pprint.pprint(dup)
The passing around of filepath
as a second argument in yield remains an annoyance to be fixed.
2 Answers 2
- That has been already mentioned, but for completeness sake, always
take care to clean up resources, including open file handles. Based
on that I'd suggest merging
get_files
intoread_files
. - Globbing and then filtering for files can be replaced with a
os.scandir
call, which might be a bit more efficient because the file information isn't being fetched twice, see the link for details on that. N.b. I'm not using awith
here because it was only added in Python 3.6 - that is, please do use awith
like shown in the docs. - I'd also suggest adding docstrings to the other functions - it's not immediately obvious what they do and more importantly what the return values are going to be.
list(get_digests([]))
gives me a non-empty list. I'd say that's not the best interface there.- The duplicate-detection part of
find_duplicates
could also be generic given how the grouping works by a key, e.g. usingitertools.groupby
. get_digests
is buggy, a new digest object needs to be used for each file.
I find this disentangled structure harder to understand than if it was a single function that produced the digests.
That said, since you explicitely want to use this kind of pipeline,
perhaps consider having a (reusable) grouping step separate from
get_digests
and then produce digests using a simpler function that
doesn't have both grouping and hashing in it.
FWIW looks like this now:
# python 3.5 but nothing important lost if I port it to python 2.7
import os
import glob
import collections
import hashlib
import functools
import itertools
BUFFER_SIZE = 2 ** 20
def read_files(filepaths):
for filepath in filepaths:
with open(filepath, mode='rb') as file:
for data in iter(functools.partial(file.read, BUFFER_SIZE), b''):
yield data, filepath
def get_digests(data_iter):
for filepath, group in itertools.groupby(data_iter, lambda x: x[1]):
hash_obj = hashlib.sha256()
for data, _ in group:
hash_obj.update(data)
yield hash_obj.digest(), filepath
def find_files(root_folder):
for entry in os.scandir(root_folder):
if entry.is_file():
yield entry.name
def find_duplicates(root_folder):
'''
Args:
root_folder: folder to start searching from
Returns:
a list of lists of paths that correspond to duplicate files
'''
# combine generators into a pipeline
digests = get_digests(read_files(find_files(root_folder)))
for digest, group in itertools.groupby(digests, lambda x: x[0]):
filepaths = list(group)
if len(filepaths) >= 2:
yield filepaths
if __name__ == "__main__":
# print duplicate files found in the current folder or below
for group in find_duplicates('.'):
print('the following files are duplicates:')
for filename in group:
print(filename)
print('\n')
-
1\$\begingroup\$ This is great, I didn't think of using
groupby
, it works so well in both cases. There are a couple of changes needed (yield entry.name
-->yield entry.path
, adding recursive traversal, removing the hash from the final output), but they are straightforward. I'll include the final code in the edit to my question. But as for my original question, I guess there's no simple way to avoid carrying aroundfilepath
in each generator of the pipeline? \$\endgroup\$max– max2017年02月18日 00:46:00 +00:00Commented Feb 18, 2017 at 0:46 -
\$\begingroup\$ Right good catch, was a bit hasty when answering it seems -.- Well you're using the
filepath
to group on, I don't think you can get rid of it? \$\endgroup\$ferada– ferada2017年02月18日 11:24:08 +00:00Commented Feb 18, 2017 at 11:24 -
1\$\begingroup\$ I meant, is there any way to get access to
filepath
in a more elegant way than manually adding it to theyield
of every step of the pipeline? And I guess the answer with simple python generators is no -- but perhaps some libraries that deal with data flow programming might provide access to the results of earlier steps of the pipeline in a more elegant way. \$\endgroup\$max– max2017年02月18日 18:21:29 +00:00Commented Feb 18, 2017 at 18:21
I think you're not closing files.
-
\$\begingroup\$ No I'm not. I rely on the garbage collector to do it: when
find_duplicates
returns, all the file objects become unreachable and will be (eventually) closed when they are garbage collected. To explicitly close them would seem to require even more ugly coding? \$\endgroup\$max– max2016年12月19日 08:16:15 +00:00Commented Dec 19, 2016 at 8:16 -
6\$\begingroup\$ This answer could be useful if you pointed out the locations and file handles that are not closed properly. \$\endgroup\$janos– janos2016年12月19日 08:37:18 +00:00Commented Dec 19, 2016 at 8:37
-
4\$\begingroup\$ Relying on the garbage collector to close files is expecting interpreter-dependent behaviour. I would still consider it to be a leak, and in any case, poor design. \$\endgroup\$200_success– 200_success2016年12月19日 09:01:33 +00:00Commented Dec 19, 2016 at 9:01
2 ** 20
with2 ** 30
or something. \$\endgroup\$yield open()
now, because it's much easier to close the file in the same loop than somewhere further down in the pipeline. \$\endgroup\$