homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Faster os.walk
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, larry, neologix, pitrou, rosslagerwall, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2012年06月27日 08:11 by serhiy.storchaka, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
faster_walk.patch serhiy.storchaka, 2012年06月27日 08:11 review
Messages (7)
msg164127 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年06月27日 08:11
Using os.fwalk (if it is available) we can make os.walk more fast.
Microbenchmark:
./python -m timeit -s "from os import walk" "for x in walk('Lib'): pass"
Results:
Vanilla: 112 msec
Patched: 90.5 msec
msg164137 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2012年06月27日 09:58
It's amusing that using fwalk and throwing away the last argument is faster than a handwritten implementation. On the other hand, fwalk also uses a lot of file descriptors. Users with processes which were already borderline on max file descriptors might not appreciate upgrading to find their os.walk calls suddenly failing.
Can you figure out why fwalk is faster, and apply that advantage to walk *without* consuming so many file descriptors?
No rush... :)
msg164141 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年06月27日 10:20
> On the other hand, fwalk also uses a lot of file descriptors. Users 
> with processes which were already borderline on max file descriptors 
> might not appreciate upgrading to find their os.walk calls suddenly 
> failing.
It doesn't have to.
Right now, it uses O(depth of the directory tree) FDs. It can be changed to only require O(1) FDs, see http://bugs.python.org/issue13734.
For example, GNU coreutils "rm -rf" uses *at() syscalls and only requires a constant number of FDs.
> Can you figure out why fwalk is faster, and apply that advantage to 
> walk *without* consuming so many file descriptors?
I didn't run any benchmark or test, but one reason why fwalk() is faster could be simply because it doesn't do as much path resolution - which is a somewhat expensive operation - thanks to the relative FD being passed.
I guess your mileage will vary with the FS in use, and the kernel version (there's been a lot of work to speed up path resolution by Nick Piggin during the last years or so).
Anyway, I think that such optimization is useless, because this micro-benchmark doesn't make much sense: when you walk a directory tree, it's usually to do something with the files/directories encountered, and as soon as you do something with them - stat(), unlink(), etc - the gain on the walking time will become negligible.
msg164143 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2012年06月27日 10:31
> It doesn't have to.
> Right now, it uses O(depth of the directory tree) FDs. 
> It can be changed to only require O(1) FDs
But closing and reopening those file descriptors seems like it might slow it down; would it still be a performance win?
Also, I'm not a security expert, but would the closing/reopening allow the possibility of timing attacks? If so, that might still be okay for walk which makes no guarantees about safety. (But obviously it would be unacceptable for fwalk.)
> Anyway, I think that such optimization is useless, because this
> micro-benchmark doesn't make much sense: when you walk a
> directory tree, it's usually to do something with the
> files/directories encountered, and as soon as you do something
> with them - stat(), unlink(), etc - the gain on the walking
> time will become negligible.
I'm not sure that "usually" is true here. I suggest that "usually" people use os.walk to find *particular files* in a directory tree, generally by filename. So most of the time os.walk really is quickly iterating over directory trees doing very little.
I think 20% is a respectable gain, and it's hard for me to say "no" to functions that make Python faster for free. (Well, for the possible cost of a slightly more expensive algorithm.) So I'm +x for now.
msg164170 - (view) Author: Ross Lagerwall (rosslagerwall) (Python committer) Date: 2012年06月27日 16:53
This looks like the kind of optimization that depends hugely on what kernel you're using. Maybe on FreeBSD/Solaris/whatever, standard os.walk() is faster?
If this micro-optimization were to be accepted, someone would have to be keen enough to test it is different ways on many different versions of every platform to conclusively prove that it is faster in general.
msg164172 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年06月27日 17:00
> This looks like the kind of optimization that depends hugely on what
> kernel you're using.
Agreed.
Also, I'm worried that there might be subtle differences between walk() and fwalk() which could come and bite users if we silently redirect the former to the latter.
(for the record, I get a 15% speedup on this Linux box)
msg173170 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年10月17日 13:20
Timing of walk depends on how deep we dive into the directories.
$ ./python -m timeit -s "from os import walk" "for x in walk('/home/serhiy/py/1/2/3/4/5/6/7/8/9/cpython/'): pass"
10 loops, best of 3: 398 msec per loop
$ ./python -m timeit -s "from os import fwalk" "for x in fwalk('/home/serhiy/py/1/2/3/4/5/6/7/8/9/cpython/'): pass"
10 loops, best of 3: 249 msec per loop
Given the above mentioned objections (consuming a lot of file descriptors, OS/FS dependency, testing burden) I withdraw my patch and close the issue. Thanks all for discussion.
History
Date User Action Args
2022年04月11日 14:57:32adminsetgithub: 59405
2012年10月17日 13:20:58serhiy.storchakasetstatus: open -> closed
resolution: rejected
messages: + msg173170

stage: resolved
2012年06月27日 17:00:45pitrousetnosy: + pitrou
messages: + msg164172
2012年06月27日 16:53:55rosslagerwallsetnosy: + rosslagerwall
messages: + msg164170
2012年06月27日 15:45:30Arfreversetnosy: + Arfrever
2012年06月27日 10:31:18larrysetmessages: + msg164143
2012年06月27日 10:20:52neologixsetnosy: + neologix
messages: + msg164141
2012年06月27日 09:58:10larrysetmessages: + msg164137
2012年06月27日 09:33:52serhiy.storchakasetnosy: + larry
2012年06月27日 08:11:22serhiy.storchakacreate

AltStyle によって変換されたページ (->オリジナル) /