This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2011年10月20日 02:05 by ncoghlan, last changed 2022年04月11日 14:57 by admin.
| Messages (19) | |||
|---|---|---|---|
| msg145999 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2011年10月20日 02:05 | |
I needed a depth-limited, filtered search of a directory tree recently and came up with the following wrapper around os.walk that brings in a few niceties like glob-style filtering, depth limiting and symlink traversal that is safe from infinite loops. It also emits a named tuple rather than the bare tuple emitted by os.walk: http://code.activestate.com/recipes/577913-selective-directory-walking/ I think this would make a nice addition to 3.3 as shutil.filter_walk, but it need tests, documentation and reformatting as a patch before it can go anywhere. |
|||
| msg147093 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2011年11月05日 16:07 | |
During the discussion about adding a chowntree function to shutil (http://mail.python.org/pipermail/python-dev/2011-May/111661.html and ), Victor suggested this: > I don't like the idea of a recursive flag. I would prefer a "map-like" function to "apply" a > function on all files of a directory. Something like shutil.apply_recursive(shutil.chown)... > ... maybe with options to choose between deep-first search and breadth-first search, filter > (filenames, file size, files only, directories only, other attributes?), directory before > files (may be need for chmod(0o000)), etc. Such a function removes the need for copytee, rmtree, dosomethingelsetree, etc. When I first read this feature request I thought it was the same idea, but after reading it again I see that you propose a function that walks and returns, not a function that walks and applies a callback. Both use cases are important, so an apply_tree function could be implemented on top of your filter_walk (I’ll open another report if we reach agreement on the idea of these two functions working together). |
|||
| msg147094 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2011年11月05日 16:08 | |
s/and /and #13033/ |
|||
| msg147128 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2011年11月06日 00:43 | |
I should probably update that posted recipe to my latest version (which adds "excluded_files" and "excluded_dirs" parameters).
However, since I've been dealing with remote filesystems where os.listdir() and os.stat() calls from the local machine aren't possible lately, I also think we may need to reconsider how this is structured and look at the idea of building a more effective pipeline model that permits more efficient modes of interaction.
Let's take 'os.walk' as the base primitive - the basis of the pipeline will always be an iterator that produces 3-tuples of a base name, a list of subdirectories and a list of files. The filtering pipeline elements will require that the underlying walk include "topdown=True" and pay attention to changes in the subdirectory list.
Then consider the following possible pipeline elements:
def filter_dirs(walk_iter, *include_filters, exclude_filters=()):
def should_include(dirname):
return any(fnmatch(dirname, include) for include in include_filters)
def should_exclude(dirname):
return any(fnmatch(dirname, include) for exclude in exclude_filters)
for dirpath, subdirs, files in walk_iter:
subdirs[:] = [subdir for subdir in subdirs
if should_include(subdir) and not should_exclude(subdir)]
yield dirpath, subdirs, files
def filter_files(walk_iter, *include_filters, exclude_filters=()):
def should_include(dirname):
return any(fnmatch(dirname, include) for include in include_filters)
def should_exclude(dirname):
return any(fnmatch(dirname, include) for exclude in exclude_filters)
for dirpath, subdirs, files in walk_iter:
files[:] = [fname for fname in files
if should_include(fname) and not should_exclude(fname)]
yield dirpath, subdirs, files
def limit_depth(walk_iter, depth):
if depth < 0:
msg = "Depth limit greater than 0 ({!r} provided)"
raise ValueError(msg.format(depth))
sep = os.sep
for top, subdirs, files in walk_iter:
yield top, subdirs, files
initial_depth = top.count(sep)
if depth == 0:
subdirs[:] = []
break
for dirpath, subdirs, files in walk_iter:
yield dirpath, subdirs, files
current_depth = dirpath.count(sep) - initial_depth
if current_depth >= depth:
subdirs[:] = []
def detect_symlink_loops(walk_iter, onloop=None):
if onloop is None:
def onloop(path):
msg = "Symlink {!r} refers to a parent directory, skipping\n"
sys.stderr.write(msg.format(path))
sys.stderr.flush()
for top, subdirs, files in walk_iter:
yield top, subdirs, files
real_top = os.path.abspath(os.path.realpath(top))
break
for dirpath, subdirs, files in walk_iter:
if os.path.islink(dirpath):
# We just descended into a directory via a symbolic link
# Check if we're referring to a directory that is
# a parent of our nominal directory
relative = os.path.relpath(dirpath, top)
nominal_path = os.path.join(real_top, relative)
real_path = os.path.abspath(os.path.realpath(dirpath))
path_fragments = zip(nominal_path.split(sep), real_path.split(sep))
for nominal, real in path_fragments:
if nominal != real:
break
else:
if not onloop(dirpath):
subdirs[:] = []
continue
yield dirpath, subdirs, files
And pipeline terminators:
def walk_dirs(walk_iter):
for dirpath, subdirs, files in walk_iter:
yield dirpath
def walk_files(walk_iter):
for dirpath, subdirs, files in walk_iter:
for fname in files:
yield os.path.join(dirpath, fname)
def walk_all(walk_iter):
for dirpath, subdirs, files in walk_iter:
yield dirpath
for fname in files:
yield os.path.join(dirpath, fname)
The pipeline terminators could then be combined with ordinary iterable consumers like comprehensions:
base_walk = detect_symlink_loops(os.walk(os.path.abspath(base_dir, followlinks=True)))
depth_limited_walk = limit_depth(base_walk, 2)
filtered_walk = filter_dirs(filter_files(depth_limited_walk, "*.py"), "*.pyp")
tree_info = {path, os.stat(path) for path in walk_all(filtered_walk)}
|
|||
| msg147129 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2011年11月06日 00:46 | |
This needs more thought - pypi package coming soon :) |
|||
| msg147131 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2011年11月06日 01:11 | |
Nick, perhaps you want to have a look at http://hg.python.org/features/pathlib/ (it doesn't have a filter_walk equivalent but it could grow one :-)) |
|||
| msg147132 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2011年11月06日 01:25 | |
That's one of the nicer attempts I've seen at an object-oriented path library, but I have a core problem with OOP path APIs, and it relates to the Unicode encoding/decoding problem: the ultimate purpose of path objects is almost always to either pass them to the OS, or else to another application. That exchange will almost *always* happen as a plain string. So when your approach is based on a more sophisticated type internally, you end up having to be very careful about all of your system boundaries, making sure that "paths" are correctly being turned into "Paths". However, one of my hopes for iterwalk will be that it *won't* care if the underlying walk iterator produces Path objects instead of ordinary strings, so long as those objects can be passed to fnmatch, os, etc and work correctly. |
|||
| msg147135 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2011年11月06日 03:01 | |
Initial version available at: https://bitbucket.org/ncoghlan/iterwalk/src I'll get it published to PyPI once the test suite is in a slightly better state (no filesystem based tests as yet). |
|||
| msg147335 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2011年11月09日 07:24 | |
A related idea is updating os.walk tuples to be a custom object offering a namedtuple style API (similar to sys.float_info). I created a separate issue for that: #13375 I've also updated the title of this issue to describe the problem it aims to address rather than the specific solution that first occurred to me. |
|||
| msg147389 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2011年11月10日 02:40 | |
I changed the package name to walkdir: https://bitbucket.org/ncoghlan/walkdir/overview |
|||
| msg147580 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2011年11月13日 23:27 | |
And walkdir is now a published package: http://walkdir.readthedocs.org My plan for this issue now is to maintain walkdir as a standalone package for 2.7 and 3.2, but still add the functionality to shutil for 3.3+. However, I'll gather feedback on the PyPI module for a while before making any changes to the 3.3 repo. |
|||
| msg152379 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2012年01月31日 12:53 | |
With the release of 0.3, I'm pretty happy with the WalkDir design (previous versions coerced the output values to ordinary 3-tuples, now it will pass along whatever the underlying iterable produces with changing the type at all). |
|||
| msg152381 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2012年01月31日 12:57 | |
(speaking of which, I'd just mention I've published pathlib as a standalone package: http://pypi.python.org/pypi/pathlib/ ) |
|||
| msg153294 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2012年02月13日 20:12 | |
Not sure this is the right place to mention this, but I'm realizing handling symlink loops will be interesting. It is "interesting" because, when e.g. you are globbing, your glob's results may not include the symlinks' target path; so there are cases where you need to keep the symlink in the results. (and in the general case you may or may not want to keep some duplicates, depending on what you do with the links) I am adding rglob() to pathlib and currently ignoring (i.e. crashing :-)) symlink loops. I would be interested in any ideas as to how handle them correctly. |
|||
| msg153295 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2012年02月13日 20:16 | |
Oh, forgot to mention, the term "symlink loop" itself is ambiguous. There are direct symlink loops: an example is a "some_dir/linky" link pointing to "../some_dir/linky". These will fail when resolving them. There are indirect symlink loops: "some_dir/linky" pointing to "../some_dir". Resolving them works fine, but recursively walking them produces an infinite recursion. Lots of fun :) |
|||
| msg153316 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2012年02月14日 03:22 | |
WalkDir attempts to handle symlink loops, but the test suite doesn't currently ensure that that handling works as expected (I did some initial manual tests and haven't updated it since, though). It's... not trivial: https://bitbucket.org/ncoghlan/walkdir/src/1b239056e9d5/walkdir.py#cl-163 |
|||
| msg179290 - (view) | Author: Charles-François Natali (neologix) * (Python committer) | Date: 2013年01月07日 22:35 | |
Nick, I think this would be a great addition (I have often seen people trying to implement this in their own code, and I certainly did it myself). What's the status of walkdir, do you think it's mature enough to be merged? |
|||
| msg179293 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2013年01月07日 22:44 | |
pathlib and walkdir are two nice piece of code, but do we need two modules? It would be nice to merge them into one unique module :-) (Or is it a stupid idea?) |
|||
| msg179303 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2013年01月08日 01:55 | |
The problem with the current walkdir implementation is that without a rich path object you end up making a lot of redundant system calls. Combined with the inherent overhead of nested generators, it just *feels* bad (even in situations where worrying about the speed is truly a case of premature optimisation) The caching in pathlib should deal with the problem of redundant system calls, so rebasing walkdir on top of that would probably be a good idea (i.e. all walkdir APIs would produce pathlib paths, and implicit convert strings they encounter to paths). |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:22 | admin | set | github: 57438 |
| 2013年03月06日 20:01:03 | gregory.p.smith | set | versions: + Python 3.4, - Python 3.3 |
| 2013年01月08日 01:55:56 | ncoghlan | set | messages: + msg179303 |
| 2013年01月07日 22:44:06 | vstinner | set | messages: + msg179293 |
| 2013年01月07日 22:35:13 | neologix | set | messages: + msg179290 |
| 2012年02月14日 03:22:09 | ncoghlan | set | messages: + msg153316 |
| 2012年02月13日 20:16:35 | pitrou | set | messages: + msg153295 |
| 2012年02月13日 20:12:57 | pitrou | set | nosy:
+ neologix messages: + msg153294 |
| 2012年01月31日 12:57:51 | pitrou | set | messages: + msg152381 |
| 2012年01月31日 12:53:57 | ncoghlan | set | messages: + msg152379 |
| 2011年12月16日 12:44:35 | giampaolo.rodola | set | nosy:
+ giampaolo.rodola |
| 2011年11月13日 23:27:13 | ncoghlan | set | messages: + msg147580 |
| 2011年11月10日 02:40:33 | ncoghlan | set | messages: + msg147389 |
| 2011年11月09日 07:24:24 | ncoghlan | set | messages:
+ msg147335 title: Add shutil.filter_walk -> Improve tools for iterating over filesystem directories |
| 2011年11月06日 03:01:11 | ncoghlan | set | messages: + msg147135 |
| 2011年11月06日 01:25:02 | ncoghlan | set | messages: + msg147132 |
| 2011年11月06日 01:11:59 | pitrou | set | nosy:
+ pitrou messages: + msg147131 |
| 2011年11月06日 00:46:21 | ncoghlan | set | messages: + msg147129 |
| 2011年11月06日 00:43:33 | ncoghlan | set | messages: + msg147128 |
| 2011年11月05日 16:08:22 | eric.araujo | set | messages: + msg147094 |
| 2011年11月05日 16:07:58 | eric.araujo | set | nosy:
+ vstinner messages: + msg147093 |
| 2011年10月21日 13:52:40 | eric.araujo | set | nosy:
+ eric.araujo |
| 2011年10月20日 02:05:30 | ncoghlan | create | |