Combining fragments of data in a generator pipeline

Question 1

Edit: it seems flow based (or reactive) programming approach would help here; there are some libraries in python that try that.

I tried to follow the generator pipeline style (see David Beazley's famous presentation) for finding duplicate files (similar to answers here). It seems pretty straightforward with MapReduce, so I thought it should also be possible to produce a clean, simple code with a generator pipeline. I tried both lambdas and named functions, but couldn't find how.

My code is especially ugly in get_digest where it's detecting that the digest has been completed for a particular file. Also annoying is the boilerplate code I use to propagate the source filepath through the pipeline.

Of course, I can just rewrite everything with nested loops, but I thought maybe I'm missing an obvious approach? Perhaps I should try to invert the data flow direction by using coroutines? Or use some itertools-style techniques to create a non-tree shaped data flow?

# python 3.5 but nothing important lost if I port it to python 2.7
import os
import glob
import collections
import hashlib
import functools
BUFFER_SIZE = 2 ** 20
def get_files(filepaths):
 for filepath in filepaths:
 yield open(filepath, mode='rb'), filepath
def read_files(files):
 for file, filepath in files:
 for data in iter(functools.partial(file.read, BUFFER_SIZE), b''):
 yield data, filepath
def get_digests(data_iter):
 current_filepath = None
 hash_obj = hashlib.sha256()
 for data, filepath in data_iter:
 if filepath != current_filepath:
 if current_filepath is not None:
 yield hash_obj.digest(), current_filepath
 current_filepath = filepath
 hash_obj.update(data)
 yield hash_obj.digest(), current_filepath
def find_duplicates(root_folder):
 '''
 Args:
 root_folder: folder to start searching from
 Returns:
 a list of lists of paths that correspond to duplicate files
 '''
 # combine generators into a pipeline
 paths = glob.iglob(os.path.join(root_folder, '**'), recursive=True)
 filepaths = filter(os.path.isfile, paths)
 files = get_files(filepaths)
 data_iter = read_files(files)
 digests = get_digests(data_iter)
 # collect data into a dictionary, then list
 # I feel this part is ok
 duplicates = collections.defaultdict(list)
 for digest, filepath in digests:
 duplicates[digest].append(filepath)
 return [v for v in duplicates.values() if len(v) >=2]
# print duplicate files found in the current folder or below
duplicates = find_duplicates('.')
for group in duplicates:
 print('the following files are duplicates:')
 for filename in group:
 print(filename)
 print('\n')

Update:

Here's the slightly modified code from @ferada answer (who fixed a bug in my code and made my code much cleaner). Per @ferada suggestion, I made get_digest just deal with digest calculation, and factored out the grouping code.

import pprint, os, glob, collections, hashlib, functools, itertools, sys, operator
BUFFER_SIZE = 2 ** 20
def read_files(filepaths):
 for filepath in filepaths:
 with open(filepath, mode='rb') as file:
 for data in iter(functools.partial(file.read, BUFFER_SIZE), b''):
 yield data, filepath
def get_digest(hash_obj, iterator):
 for data in iterator:
 hash_obj.update(data)
 return hash_obj.digest()
def get_digests(data_iter):
 for filepath, group in itertools.groupby(data_iter, key=lambda x: x[1]):
 yield get_digest(hashlib.sha256(), map(operator.itemgetter(0), group)), filepath
def scantree(path):
 """Recursively yield DirEntry objects for given directory.
 From https://stackoverflow.com/a/33135143/336527
 """
 with os.scandir(path) as it:
 for entry in it:
 if entry.is_dir(follow_symlinks=False):
 yield from scantree(entry.path) # see below for Python 2.x
 else:
 yield entry
def find_files(root_folder):
 '''Yields full paths of all files starting with root_folder, recursively'''
 for entry in scantree(root_folder):
 if entry.is_file():
 yield entry.path
def find_duplicates(root_folder):
 '''
 Args:
 root_folder: folder to start searching from
 Yields:
 Tuples of paths that correspond to duplicate files
 '''
 filepaths = find_files(root_folder)
 data_iter = read_files(filepaths)
 digests = get_digests(data_iter)
 for _, group in itertools.groupby(digests, key=lambda x: x[0]):
 _, filepaths = zip(*group)
 if len(filepaths) >= 2:
 yield filepaths
def main():
 folder = sys.argv[1]
 for dup in find_duplicates(folder):
 pprint.pprint(dup)

The passing around of filepath as a second argument in yield remains an annoyance to be fixed.

Question 2

get_files should be get_file_names and just return filenames. yield open() smells for me.

Question 3

Besides, why just hash 2 ** 20 bytes of a file?

Question 4

@MKesper I'm pretending the files are too big to fit in memory. If you want to be more realistic, replace 2 ** 20 with 2 ** 30 or something.

Question 5

@MKesper and yes, I also don't like yield open() now, because it's much easier to close the file in the same loop than somewhere further down in the pipeline.

Question 6

That has been already mentioned, but for completeness sake, always take care to clean up resources, including open file handles. Based on that I'd suggest merging get_files into read_files.
Globbing and then filtering for files can be replaced with a os.scandir call, which might be a bit more efficient because the file information isn't being fetched twice, see the link for details on that. N.b. I'm not using a with here because it was only added in Python 3.6 - that is, please do use a with like shown in the docs.
I'd also suggest adding docstrings to the other functions - it's not immediately obvious what they do and more importantly what the return values are going to be.
list(get_digests([])) gives me a non-empty list. I'd say that's not the best interface there.
The duplicate-detection part of find_duplicates could also be generic given how the grouping works by a key, e.g. using itertools.groupby.
get_digests is buggy, a new digest object needs to be used for each file.

I find this disentangled structure harder to understand than if it was a single function that produced the digests.

That said, since you explicitely want to use this kind of pipeline, perhaps consider having a (reusable) grouping step separate from get_digests and then produce digests using a simpler function that doesn't have both grouping and hashing in it.

FWIW looks like this now:

# python 3.5 but nothing important lost if I port it to python 2.7
import os
import glob
import collections
import hashlib
import functools
import itertools
BUFFER_SIZE = 2 ** 20
def read_files(filepaths):
 for filepath in filepaths:
 with open(filepath, mode='rb') as file:
 for data in iter(functools.partial(file.read, BUFFER_SIZE), b''):
 yield data, filepath
def get_digests(data_iter):
 for filepath, group in itertools.groupby(data_iter, lambda x: x[1]):
 hash_obj = hashlib.sha256()
 for data, _ in group:
 hash_obj.update(data)
 yield hash_obj.digest(), filepath
def find_files(root_folder):
 for entry in os.scandir(root_folder):
 if entry.is_file():
 yield entry.name
def find_duplicates(root_folder):
 '''
 Args:
 root_folder: folder to start searching from
 Returns:
 a list of lists of paths that correspond to duplicate files
 '''
 # combine generators into a pipeline
 digests = get_digests(read_files(find_files(root_folder)))
 for digest, group in itertools.groupby(digests, lambda x: x[0]):
 filepaths = list(group)
 if len(filepaths) >= 2:
 yield filepaths
if __name__ == "__main__":
 # print duplicate files found in the current folder or below
 for group in find_duplicates('.'):
 print('the following files are duplicates:')
 for filename in group:
 print(filename)
 print('\n')

Question 7

This is great, I didn't think of using groupby, it works so well in both cases. There are a couple of changes needed (yield entry.name --> yield entry.path, adding recursive traversal, removing the hash from the final output), but they are straightforward. I'll include the final code in the edit to my question. But as for my original question, I guess there's no simple way to avoid carrying around filepath in each generator of the pipeline?

Question 8

Right good catch, was a bit hasty when answering it seems -.- Well you're using the filepath to group on, I don't think you can get rid of it?

Question 9

I meant, is there any way to get access to filepath in a more elegant way than manually adding it to the yield of every step of the pipeline? And I guess the answer with simple python generators is no -- but perhaps some libraries that deal with data flow programming might provide access to the results of earlier steps of the pipeline in a more elegant way.

Question 10

I think you're not closing files.

Question 11

No I'm not. I rely on the garbage collector to do it: when find_duplicates returns, all the file objects become unreachable and will be (eventually) closed when they are garbage collected. To explicitly close them would seem to require even more ugly coding?

Question 12

This answer could be useful if you pointed out the locations and file handles that are not closed properly.

Question 13

Relying on the garbage collector to close files is expecting interpreter-dependent behaviour. I would still consider it to be a leak, and in any case, poor design.

ferada ferada 11.4k25 silver badges65 bronze badges · Accepted Answer · 2017-02-17 15:55:11Z

That has been already mentioned, but for completeness sake, always take care to clean up resources, including open file handles. Based on that I'd suggest merging get_files into read_files.
Globbing and then filtering for files can be replaced with a os.scandir call, which might be a bit more efficient because the file information isn't being fetched twice, see the link for details on that. N.b. I'm not using a with here because it was only added in Python 3.6 - that is, please do use a with like shown in the docs.
I'd also suggest adding docstrings to the other functions - it's not immediately obvious what they do and more importantly what the return values are going to be.
list(get_digests([])) gives me a non-empty list. I'd say that's not the best interface there.
The duplicate-detection part of find_duplicates could also be generic given how the grouping works by a key, e.g. using itertools.groupby.
get_digests is buggy, a new digest object needs to be used for each file.

I find this disentangled structure harder to understand than if it was a single function that produced the digests.

That said, since you explicitely want to use this kind of pipeline, perhaps consider having a (reusable) grouping step separate from get_digests and then produce digests using a simpler function that doesn't have both grouping and hashing in it.

FWIW looks like this now:

# python 3.5 but nothing important lost if I port it to python 2.7
import os
import glob
import collections
import hashlib
import functools
import itertools
BUFFER_SIZE = 2 ** 20
def read_files(filepaths):
 for filepath in filepaths:
 with open(filepath, mode='rb') as file:
 for data in iter(functools.partial(file.read, BUFFER_SIZE), b''):
 yield data, filepath
def get_digests(data_iter):
 for filepath, group in itertools.groupby(data_iter, lambda x: x[1]):
 hash_obj = hashlib.sha256()
 for data, _ in group:
 hash_obj.update(data)
 yield hash_obj.digest(), filepath
def find_files(root_folder):
 for entry in os.scandir(root_folder):
 if entry.is_file():
 yield entry.name
def find_duplicates(root_folder):
 '''
 Args:
 root_folder: folder to start searching from
 Returns:
 a list of lists of paths that correspond to duplicate files
 '''
 # combine generators into a pipeline
 digests = get_digests(read_files(find_files(root_folder)))
 for digest, group in itertools.groupby(digests, lambda x: x[0]):
 filepaths = list(group)
 if len(filepaths) >= 2:
 yield filepaths
if __name__ == "__main__":
 # print duplicate files found in the current folder or below
 for group in find_duplicates('.'):
 print('the following files are duplicates:')
 for filename in group:
 print(filename)
 print('\n')

This is great, I didn't think of using groupby, it works so well in both cases. There are a couple of changes needed (yield entry.name --> yield entry.path, adding recursive traversal, removing the hash from the final output), but they are straightforward. I'll include the final code in the edit to my question. But as for my original question, I guess there's no simple way to avoid carrying around filepath in each generator of the pipeline?
Right good catch, was a bit hasty when answering it seems -.- Well you're using the filepath to group on, I don't think you can get rid of it?
I meant, is there any way to get access to filepath in a more elegant way than manually adding it to the yield of every step of the pipeline? And I guess the answer with simple python generators is no -- but perhaps some libraries that deal with data flow programming might provide access to the results of earlier steps of the pipeline in a more elegant way.

Stack Exchange Network

Combining fragments of data in a generator pipeline

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Combining fragments of data in a generator pipeline

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions