homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add a generic directory walker method to avoid symlink attacks
Type: security Stage: resolved
Components: Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, hynek, jcea, ncoghlan, neologix, pitrou, python-dev, rosslagerwall, tarek
Priority: normal Keywords: patch

Created on 2012年01月08日 00:14 by hynek, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
fdwalk-2.diff neologix, 2012年01月12日 06:52 review
fwalk-3.diff neologix, 2012年01月31日 21:42 review
fwalk-single_fd.diff neologix, 2012年01月31日 21:42 review
fwalk-doc.diff neologix, 2012年05月12日 15:06 review
Messages (29)
msg150833 - (view) Author: Hynek Schlawack (hynek) * (Python committer) Date: 2012年01月08日 00:14
This is an offspring of #4489 (which is a security bug). The method is AFAIU intended to be private. 
As shown in the discussion of the mentioned #4489, there is a whole family of attacks that exploit the time window between gathering path names and executing a function on them. A general description of this problem can be found in: https://www.securecoding.cert.org/confluence/display/seccode/POS35-C.+Avoid+race+conditions+while+checking+for+the+existence+of+a+symbolic+link
While the consequences in rmtree() are probably most dramatic, other recursive functions could benefit too (chmodtree() and chowntree() were mentioned) so Charles-François suggested to write a "generic walker method that would take as argument the methods to call on a directory and on a file (or link)".
Some (probably) necessary helper functions has been already implemented in #4761 (*at()) and #10755 (fdlistdir()).
Has there already been done any work? Ross mentioned he wanted to take a stab?
msg150838 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2012年01月08日 01:08
I'm working on a library of general directory walking tools that will hopefully make their way back into the stdlib at some point (http://walkdir.readthedocs.org).
They're designed to filter and transform the output of os.walk (and similar iterators) in various ways.
It may provide a good environment for prototyping a general purpose "tree_map" for applying an operation to a filesystem tree without being vulnerable to symlink attacks.
msg150845 - (view) Author: Ross Lagerwall (rosslagerwall) (Python committer) Date: 2012年01月08日 04:03
> Has there already been done any work? Ross mentioned he wanted to take a stab?
Unfortunately, I'm rather busy at the moment but when I get some free time and if no one else wants to work on it then I'll take a look.
msg150855 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2012年01月08日 10:13
Since walkdir is currently entirely based on returning filesystem paths as strings (just like os.walk()) and hence shares the pervasive symlink attack vulnerability, I'm particularly interested in the question of whether or not the various *at APIs can be used to avoid symlink attacks if we just have a os.walkfd() API that emits a (dirfd, subdirs, files) triple instead of the os.walk style (dirpath, subdirs, files).
The reason I'd find that interesting is that many of walkdir's filtering steps (notably those for including and excluding directories) don't care about the value of dirpath, so they could still be used with such an API.
Thoughts?
msg150858 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2012年01月08日 11:44
Another, possibly better, alternative would be to produce a tuple-subclass that adds a separate "dirfd" attribute to the (dirpath, subdirs, files) triple.
I'll stop talking about the walkdir implications here. Instead, I've created a corresponding issue on walkdir's own tracker:
https://bitbucket.org/ncoghlan/walkdir/issue/8/add-walkfd 
msg150864 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年01月08日 13:55
> Since walkdir is currently entirely based on returning filesystem
> paths as strings (just like os.walk()) and hence shares the pervasive
> symlink attack vulnerability, I'm particularly interested in the
> question of whether or not the various *at APIs can be used to avoid
> symlink attacks if we just have a os.walkfd() API that emits a (dirfd,
> subdirs, files) triple instead of the os.walk style (dirpath, subdirs,
> files).
Be aware that you have to manage dirfd's lifetime, which can make things
interesting.
Also be aware that symlinks mean sometimes you won't have a dirfd: if
you have a symlink that points to another directory, you can't open that
directory using openat from the symlink's directory. So if you follow
symlinks (or have an option to do so) you must also take that case into
account.
msg150870 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年01月08日 15:25
Here's a possible walkfd() implementation.
Example:
"""
$ cat /home/cf/testwalkfd.py
import os
import sys
topfd = os.open(sys.argv[1], os.O_RDONLY)
for rootfd, dirs, files in os.walkfd(topfd):
 print(rootfd, dirs, files)
$ ./python ~/testwalkfd.py /etc/apt/
3 ['sources.list.d', 'preferences.d', 'trusted.gpg.d', 'apt.conf.d']
['trustdb.gpg', 'trusted.gpg~', 'sources.list', 'trusted.gpg']
4 [] []
4 [] []
4 [] []
4 [] ['70debconf', '01autoremove', '00trustcdrom']
[44194 refs]
"""
AFAICT, a safe rmtree could be implemented simply with
walkfd(topdown=False), but Antoine's remarks make me thing I missed
something.
> Be aware that you have to manage dirfd's lifetime, which can make things
> interesting.
Basically, this means that doing:
for rootfd, dirs, files in walkfd(topfd):
 print(fstat(rootfd), dirs, files))
is valid whereas
print([(fstat(rootfd), dirs, files) for (rootfd, dirs, files) in
walkfd(topfd)]) isn't.
> Also be aware that symlinks mean sometimes you won't have a dirfd: if
> you have a symlink that points to another directory, you can't open that
> directory using openat from the symlink's directory. So if you follow
> symlinks (or have an option to do so) you must also take that case into
> account.
I'm not sure I understand this. Why "you can't open that directory
using openat from the symlink's directory". Could you elaborate?
msg150873 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年01月08日 15:48
> > Also be aware that symlinks mean sometimes you won't have a dirfd: if
> > you have a symlink that points to another directory, you can't open that
> > directory using openat from the symlink's directory. So if you follow
> > symlinks (or have an option to do so) you must also take that case into
> > account.
> 
> I'm not sure I understand this. Why "you can't open that directory
> using openat from the symlink's directory". Could you elaborate?
Hmm, sorry, I must have misremembered. I thought openat didn't follow
symlinks.
As for the patch, I think there's a problem with the API:
+ This behaves exactly like walk(), except that it accepts a file descriptor
+ as top directory, and yields a 3-tuple
+
+ dirfd, dirnames, filenames
It doesn't tell you to which dirname corresponds dirfd, so you don't
know the path of the directory you are handed (which can be useful for
progress report, error report, or anything else where you need the name
- e.g. making a zip archive of the directory). Also giving the dirnames
without their fds encourages using them by name, not by fd ;-)
Also, walkfd would be easier to use if callable with a str or bytes path
rather than an int fd.
msg150916 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2012年01月08日 23:40
Thanks for that Charles-François - do you mind if I adapt that for walkdir?
The changes I would make are basically those that Antoine pointed out:
- rather than replacing the dirpath entry, instead yield a 4-tuple that appends the dirfd attribute at the end*
- if the dup(fd) fails, fall back to assuming top is a string or bytes path
*I'm still interested in opinions on this aspect. I see 5 main possibilities:
- (dirfd, subdirs, files) triple (problematic for the reasons Antoine pointed out)
- (dirpath, subdirs, files, dirfd) 4-tuple
- ((dirpath, dirfd), subdirs, files) nested tuple
- (dirpath, subdirs, files) tuple subclass with separate dirfd attribute
- (dirpath, subdirs, files) triple with dirpath as a str subclass with a separate fd attribute
I'm currently leaning towards the simple 4-tuple approach - it's simple, explicit and the walkdir pipeline operations can easily accept either underlying iterable by using indexing operations rather than tuple unpacking (a change I already planned to make so that the pipeline passed along the objects from the underlying iterable, anyway)
(I'll also need to create ctypes-based variants of all the relevant *at functions, since the stdlib wrappers won't be available in existing versions of Python)
msg150919 - (view) Author: Ross Lagerwall (rosslagerwall) (Python committer) Date: 2012年01月09日 03:54
> I'm currently leaning towards the simple 4-tuple approach
I would also take that approach. It seems simplest to me.
msg150925 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年01月09日 08:05
> Hmm, sorry, I must have misremembered. I thought openat didn't follow
> symlinks.
OK, I was afraid I had missed something.
 > As for the patch, I think there's a problem with the API
Yes, it was really a proof-of-concept, the directory names are missing.
> Also, walkfd would be easier to use if callable with a str or bytes path
> rather than an int fd.
Agreed.
> Also giving the dirnames without their fds encourages using them
> by name, not by fd ;-)
Well, that's not easy:
- right now, the code uses O(depth of directory hierarchy) FDs
- returning FDs for sub-directories would require O(number of nodes in
the hierarchy), or at least O(max number of child nodes): I fear we'll
run out of FDs quite easily
Also, I don't think it's really a problem, since you have to use the
names anyway. The *at() family accepts a FD as a pointer to the
containing directory, but the target entries are accessed by name.
For example, to perform a safe rmtree, you would do something like:
for dirfd, dirs, files in os.walkfd(topfd, topdown=False):
 for file in files:
 os.unlinkat(dirfd, file)
 for dir in dirs:
 os.unlinkat(dirfd, dir, os.AT_REMOVEDIR)
> Thanks for that Charles-François - do you mind if I adapt that for walkdir?
Of course not, go ahead. I'll update walkfd() accordingly, and write
doc and test for it.
By the way, do you plan to get walkdir merged in 3.3?
I've been doing a lot of sys-admin scripts lately, and this would be
really helpful.
> I'm currently leaning towards the simple 4-tuple approach
Sounds good to me.
msg150991 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2012年01月10日 00:33
OK, os.walkfd is sounding good:
- accepts a file descriptor, byte sequence or string for "top"
- produces 4-tuples, with the dirfd added at the end
- documents clearly that the dirfd is normally only valid until the next iteration step, so you need to call os.dup() if you want to hang onto it
As far as walkdir integration goes, I currently plan to add it as a "Directory Walking" subsection in shutil before the first alpha. However, it needs a few more updates in PyPI first (e.g. preserving the tuples produced by the underlying iterators, making sure it behaves itself when handed binary paths). I'll post to python-dev about it before I actually commit anything.
msg151032 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年01月10日 23:06
Here's a patch with tests and documentation.
I noticed something surprising:
walk() with followlinks=False returns links to directories as
directories (in dirnames).
I find this surprising, since if you don't follow symlinks, those are
just files (and you don't recurse into it). Also, it's a pain when you
want to remove dirnames, since you have to distinguish between a link
and a directory (unlink()/rmdir() or unlinkat() without/with
AT_REMOVEDIR)
To be consistent with this behavior, I had to change fdwalk() (I
renamed it to be consistent with fdlistdir()) to perform a call to
fstatat() without AT_SYMLINK_NOFOLLOW, since otherwise it would report
such links as files.
So the bottom line is that because of this, you can have up to 3
stat() calls per entry:
- fstatat(rootfd, name)
- fstatat(rootfd, name, AT_SYMLINK_NOFOLLOW) right before opening the directory
- fstat(dirfd) right after open to check that we're dealing with the same file
(walk() currently uses two stat() per entry, so it's not too bad).
msg151077 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年01月11日 18:58
Here's an updated version.
Note that I'm not pushing towards changing the current behavior
pertaining to symlinks to directories, because if we change this, this
will break code.
For example to count the number of lines of all the files under a
directory, a code could go like this:
for root, dirs, files in os.walk(top):
 for file in files:
 f = open(file)
 for n, l in enumerate(f, 1):
 pass
 print(file, n)
If, suddently, a symlink to a directory appeared in files, this will
break. So I'm not convinced it's worth changing this. A symlink to a
directory is not much closer to a file than to a directory, it really
depends on the use case.
I'm also fine with keeping fdwalk() consistent with this to make
porting easier (and also because it makes it easy to test, I just have
to compare fdwlak()'s output to walk()'s output).
msg152376 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2012年01月31日 12:08
Hmm, given the various *at() APIs that sort alphabetically next to their path based counterparts, perhaps we can make the naming consistency change go the other way? (i.e. listdirfd() and walkfd()). Even if POSIX puts the fd at the front, do we really have to follow them in a silly naming scheme?
And as per the python-dev discussion, +1 for being consistent with os.walk() when it comes to symlinks to directories.
Aside from the naming question, is there anything else holding this up?
msg152377 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2012年01月31日 12:27
Taking a closer look at the current naming scheme in the os module, fdopen() appears to be the only current function that uses the 'fd' prefix. All the other operations that accept a file descriptor just use 'f' as the prefix (fchmod, fchown, faccess, etc).
Given that, flistdir() and fwalk() seem like the most consistent choices of name for APIs that aren't directly matching an underlying POSIX function name.
msg152378 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年01月31日 12:31
There's something I don't understand in the patch: why does _are_same_file examine st_mode?
msg152385 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年01月31日 13:45
> Given that, flistdir() and fwalk() seem like the most consistent choices of name for APIs that aren't directly
> matching an underlying POSIX function name.
Well, that seems OK for me.
I guess the only reason fdlistdir() is named that way is because of
fdopendir(3).
I can make the change for fwalk(), and since 3.3 hasn't been released
yet, I guess we can rename fdlistdir() too.
> There's something I don't understand in the patch: why does _are_same_file examine st_mode?
It doesn't have to, that's actually useless.
The only thing that bothers me is that it needs O(height of directory
tree), so with really deep directory trees, we could run out of FDs.
Not sure that could be a problem in practice, but that's something to
keep in mind.
msg152386 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年01月31日 13:51
One other thing: is it deliberate to silence errors in the following snippet?
+ try:
+ names = fdlistdir(topfd)
+ except error as err:
+ if onerror is not None:
+ onerror(err)
+ return
msg152387 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年01月31日 14:04
It's to mimic os.walk()'s behavior:
http://hg.python.org/cpython/file/bf31815548c9/Lib/os.py#l268 
msg152408 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年01月31日 21:42
Here are two new versions, both addressing Antoine's and Nick's comments:
- fwalk-3.diff is just an updated version
- fwalk-single_fd.diff doesn't use more than 2 FDs to walk a directory
tree, instead of the depth of directory tree. It's not as simple and
clean as I'd like it to be, but it should be much more robust, and
still safe (please make sure about that :-).
I was a little worried about the performance impact, so I did some
trivial benchmarks:
- O(depth) fwalk() is actually a tiny bit faster than walk() (it may
be because we don't do as much path lookup)
- O(1) fwalk() is around 20% slower, on a pure-traversal benchmark (so
in a realistic use case where we would actually do something with the
values returned by fwalk() the difference shouldn't be that
noticeable)
msg152410 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年01月31日 21:56
> I was a little worried about the performance impact, so I did some
> trivial benchmarks:
> - O(depth) fwalk() is actually a tiny bit faster than walk() (it may
> be because we don't do as much path lookup)
> - O(1) fwalk() is around 20% slower, on a pure-traversal benchmark (so
> in a realistic use case where we would actually do something with the
> values returned by fwalk() the difference shouldn't be that
> noticeable)
I think the O(depth) version is fine. The O(1) version is quite more
complicated, difficult to follow, and it seems less robust (it doesn't
use try/finally and therefore might leak fds if the generator isn't
exhausted before being destroyed).
On modern systems you have at least 1024 fds, so the restriction
shouldn't be a problem.
msg152425 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年02月01日 08:36
> I think the O(depth) version is fine. The O(1) version is quite more
> complicated, difficult to follow, and it seems less robust (it doesn't
> use try/finally and therefore might leak fds if the generator isn't
> exhausted before being destroyed).
I agree.
> On modern systems you have at least 1024 fds, so the restriction
> shouldn't be a problem.
Actually, I think you're much more likely to run above the max
recursion limit than RLIMIT_NOFILE (OTOH, you don't know how many FDs
are already open at the time of the call).
msg152426 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2012年02月01日 10:08
I'm also a fan of using the simpler approach unless/until we have solid evidence that the file descriptor limit could be a problem in practice.
A comment in the code mentioning the concern, along with the fact that there's an alternate algorithm attached to this tracker issue that avoids it would probably be appropriate though.
msg152689 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012年02月05日 14:16
New changeset 773a97b3927d by Charles-François Natali in branch 'default':
Issue #13734: Add os.fwalk(), a directory walking function yielding file
http://hg.python.org/cpython/rev/773a97b3927d 
msg152691 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年02月05日 15:19
Committed, thanks for the comments.
Note to myself (and others that might be interested in the O(1)) version):
we can't simply call openat(dirfd, "..", O_RDONLY) to re-open the current directory's file descriptor after having walked a into one of its subdirectories because if this subdirectory is actually a link, we'll open the parent directory of the target directory, instead of the current (toppath) directory. OTOH, if the user passes followlinks=True, then we don't have to bother with openat() and friends in which case we don't have to bother passing FDs between calls to fwalk().
msg160376 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012年05月10日 21:20
It would be nice if the documentation of fwalk() explained why you would want to use it over walk().
msg160472 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年05月12日 15:06
> It would be nice if the documentation of fwalk() explained why you would
> want to use it over walk().
How does the attached patch look?
msg160490 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012年05月12日 19:52
2012年5月12日 Charles-François Natali <report@bugs.python.org>:
>
> Charles-François Natali <neologix@free.fr> added the comment:
>
>> It would be nice if the documentation of fwalk() explained why you would
>> want to use it over walk().
>
> How does the attached patch look?
Okay, but explain what a "symlink attack" is.
History
Date User Action Args
2022年04月11日 14:57:25adminsetgithub: 57943
2012年05月12日 19:52:12benjamin.petersonsetmessages: + msg160490
2012年05月12日 15:06:13neologixsetfiles: + fwalk-doc.diff

messages: + msg160472
2012年05月10日 21:20:56benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg160376
2012年02月05日 15:19:30neologixsetstatus: open -> closed
resolution: fixed
messages: + msg152691

stage: resolved
2012年02月05日 14:16:01python-devsetnosy: + python-dev
messages: + msg152689
2012年02月01日 10:08:46ncoghlansetmessages: + msg152426
2012年02月01日 08:36:08neologixsetmessages: + msg152425
2012年01月31日 21:56:14pitrousetmessages: + msg152410
2012年01月31日 21:42:02neologixsetfiles: + fwalk-3.diff, fwalk-single_fd.diff

messages: + msg152408
2012年01月31日 14:04:54neologixsetmessages: + msg152387
2012年01月31日 13:51:37pitrousetmessages: + msg152386
2012年01月31日 13:45:50neologixsetmessages: + msg152385
2012年01月31日 12:31:47pitrousetmessages: + msg152378
2012年01月31日 12:27:16ncoghlansetmessages: + msg152377
2012年01月31日 12:09:00ncoghlansetmessages: + msg152376
2012年01月12日 06:52:47neologixsetfiles: + fdwalk-2.diff
2012年01月12日 06:52:27neologixsetfiles: - fdwalk-1.diff
2012年01月12日 06:52:09neologixsetfiles: - fdwalk.diff
2012年01月12日 06:51:58neologixsetfiles: - walkfd.diff
2012年01月11日 18:58:53neologixsetfiles: + fdwalk-1.diff

messages: + msg151077
2012年01月10日 23:06:21neologixsetfiles: + fdwalk.diff

messages: + msg151032
2012年01月10日 00:33:48ncoghlansetmessages: + msg150991
2012年01月09日 14:28:05jceasetnosy: + jcea
2012年01月09日 08:05:31neologixsetmessages: + msg150925
2012年01月09日 03:54:38rosslagerwallsetmessages: + msg150919
2012年01月08日 23:40:21ncoghlansetmessages: + msg150916
2012年01月08日 15:48:23pitrousetmessages: + msg150873
2012年01月08日 15:25:48neologixsetfiles: + walkfd.diff
keywords: + patch
messages: + msg150870
2012年01月08日 13:55:20pitrousetmessages: + msg150864
2012年01月08日 11:44:27ncoghlansetmessages: + msg150858
2012年01月08日 10:13:34ncoghlansetmessages: + msg150855
2012年01月08日 06:20:24rosslagerwalllinkissue4489 dependencies
2012年01月08日 04:03:02rosslagerwallsetmessages: + msg150845
2012年01月08日 01:08:23ncoghlansetnosy: + ncoghlan
messages: + msg150838
2012年01月08日 00:14:02hynekcreate

AltStyle によって変換されたページ (->オリジナル) /