homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Use scandir() to speed up the glob module
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: 16620 25994 Superseder:
Assigned To: serhiy.storchaka Nosy List: Dima.Tisnek, benhoyt, gvanrossum, python-dev, serhiy.storchaka, vstinner, xdegaye
Priority: normal Keywords: patch

Created on 2015年11月10日 11:18 by xdegaye, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
glob_isdir.patch xdegaye, 2015年11月10日 11:18 review
glob_isdir_2.patch xdegaye, 2015年11月10日 15:35 review
glob_scandir.patch serhiy.storchaka, 2016年01月02日 15:22 review
glob_scandir_2.patch serhiy.storchaka, 2016年01月03日 08:23 review
glob_scandir_doc.patch xdegaye, 2016年01月08日 13:53 review
glob_scandir_2_diff.patch xdegaye, 2016年01月10日 14:53
glob_scandir_3.patch xdegaye, 2016年01月10日 14:53 review
glob_scandir_4.patch serhiy.storchaka, 2016年01月11日 08:33 review
glob_scandir_5.patch serhiy.storchaka, 2016年02月11日 11:57 review
Messages (30)
msg254440 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2015年11月10日 11:18
The glob module happily joins names of regular files together with os.path.join() or attempts to list the files contained into a regular file (sic). The same 'except os.error' statement is used to handle both these cases and the case of a non readable directory.
The attached patch makes the code more correct and easier to understand.
msg254449 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2015年11月10日 15:35
This second patch rewrites the conditionals that decide whether to call _iglob() recursively, in a more natural way and without changing the behavior from the first patch.
Note that when 'dirname == pathname', then basename is empty and glob2() or glob1() are not called.
The patch also refactors the code with a new _listdir() function.
msg257347 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016年01月02日 15:22
In general the patch LGTM. It allows to speed up glob by 5-10% in warn test. But much more gain we can achieve by using os.scandir(). Here are results of microbenchmarks:
$ ./python -m timeit -s "from glob import glob" -- "glob('**/*', recursive=True)"
Unpatched: 201 msec per loop
Using isdir(): 181 msec per loop
Using scandir(): 65.2 msec per loop
$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', recursive=True)"
Unpatched: 2.06 sec per loop
Using isdir(): 1.92 sec per loop
Using scandir(): 820 msec per loop
$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', recursive=True)"
Unpatched: 1.77 sec per loop
Using isdir(): 1.61 sec per loop
Using scandir(): 431 msec per loop
Yet one benefit is that iglob() now yields path names without the delay for reading the full content of a directory (see issue22167).
msg257354 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016年01月02日 19:49
I don't think it is a good idea to break backward compatibility here even if the globN are internal functions. Better would be to continue to have globN functions that support the existing API and emit deprecation warnings.
msg257403 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016年01月03日 08:23
I think so too. I just wanted someone to confirmed that it is not overcautiousness. For now glob1() is used only in one place in the stdlib (issue16620).
But there was other problem with previous patch, the same as with current implementation of os.walk() (issue25995). It makes glob to use a lot of file descriptors.
Updated patch lefts deprecated glob1() and glob2() and makes glob to consume all scandir iterator before starting to yield values (but the problem with fd leaks on non-refcounted implementations is still left, issue25994). This doesn't affect performance, but lefts the issue with delaying (issue22167).
msg257421 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016年01月03日 14:55
os.scandir() is called recursively in the last patch and the file descriptors are not closed until returning from the recursion.
The glob functions should fail explicitly when scandir() raises OSERROR with posix errno set to EMFILE (The process has too many files open), or even better, silently ignore only the PermissionError exception instead of ignoring OSERROR.
msg257423 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016年01月03日 16:30
I may be missing something, anyway here are few comments on the patch:
Is the call to entry.is_symlink() in _iterdir() necessary since is_dir() follows symlinks ?
If _iterdir() would yield the DirEntry instance instead of DirEntry.name, then _rlistdir() could use x.is_dir() to know whether it is correct to iterate over _rlistdir(x.path, dironly)
msg257424 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016年01月03日 17:20
> os.scandir() is called recursively in the last patch and the file descriptors are not closed until returning from the recursion.
No, os.scandir() is called non-recursively in _iterdir(), and since _iterdir() results are always accumulated in a list, a recursion starts only after exhausting the os.scandir() iterator and closing the file descriptor. We need first to resolve issue25994 to close the file descriptor explicitly.
> The glob functions should fail explicitly when scandir() raises OSERROR with posix errno set to EMFILE (The process has too many files open), or even better, silently ignore only the PermissionError exception instead of ignoring OSERROR.
Patched code passes existing tests. Do you have additional tests?
> Is the call to entry.is_symlink() in _iterdir() necessary since is_dir() follows symlinks ?
Ah, I thought is_dir() doesn't follow symlinks. Yes, now the call to entry.is_symlink() is not necessary.
> If _iterdir() would yield the DirEntry instance instead of DirEntry.name, then _rlistdir() could use x.is_dir() to know whether it is correct to iterate over _rlistdir(x.path, dironly)
Yes, but this can complicate the rest of the code. _rlistdir() is called with dironly=False only when the pattern ends with '**'. I'm not sure this is enough important case for optimization. In most cases '**' is used in the middle of the pattern, and all names yielded by _iterdir() are directory names (or broken symlinks).
msg257456 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016年01月04日 11:20
> and since _iterdir() results are always accumulated in a list
Right, I failed to note that point. And so, since the file descriptor opened by os.scandir() is closed within each call to recursive _rlistdir(), then my other comment about EMFILE does not stand as well.
msg257655 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016年01月06日 22:59
Related issue: #26032 "Use scandir() to speed up pathlib globbing".
msg257659 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016年01月06日 23:08
(IOW once this patch has been applied maybe you can do the same for globbing in pathlib as requested in issue #26032.)
msg257758 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016年01月08日 13:53
Adding a doc patch.
msg257912 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016年01月10日 14:53
FWIW I have followed the idea of having _iterdir() yielding the DirEntry entry instead of the name in Serhiy's patch. There is a slight performance gain. Now _glob0() and _glob1() do return a list of directories when dironly is true but there is now another place where OSError must be tracked, so it is not clear if this is worth it.
glob_scandir_2_diff.patch is the differential patch between glob_scandir_2.patch and glob_scandir_3.patch.
Here are the performance tests run with both patches.
$ ./python -m timeit -s "from glob import glob" -- "glob('**/*', recursive=True)"
glob_scandir_2.patch: 33.1 msec per loop
glob_scandir_3.patch: 33.8 msec per loop
$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', recursive=True)"
glob_scandir_2.patch: 927 msec per loop
glob_scandir_3.patch: 850 msec per loop
$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', recursive=True)"
glob_scandir_2.patch: 423 msec per loop
glob_scandir_3.patch: 337 msec per loop
msg257930 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016年01月10日 20:04
Oops, the list comprehension in _glob1() of glob_scandir_3.patch should be instead: names = [x.name for x in _iterdir(dirname, dironly)]
This does not change the tests results except it improves glob('/usr/lib*/**/*', True) that gives now 820 msec instead of 927 msec with glob_scandir_2.patch.
msg257938 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016年01月10日 21:09
Actually the microbenchmarks comparison between glob_scandir_2.patch and glob_scandir_3.patch must be made by removing the call to entry.is_symlink() also in glob_scandir_2.patch.
In that case the microbenchmarks give about the same results for each patch.
Sorry for the noise.
msg257947 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016年01月11日 03:14
@serhiy, I assume this is in your capable hands?
Personally I'd like to have seen the refactoring in a separate diff from the switch to scandir(), just to ensure that the refactoring does not change anything and to make it easier to read the changes for the scandir() switch.
msg257953 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016年01月11日 08:33
Thank you Xavier for correcting the documentation.
Here is minimal patch that switches the glob module to scandir().
msg257954 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016年01月11日 12:18
Results of microbenchmarks for glob_scandir_4.patch:
$ ./python -m timeit -s "from glob import glob" -- "glob('**/*', recursive=True)"
Unpatched: 275 msec per loop
Patched: 79.4 msec per loop
$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', recursive=True)"
Unpatched: 2.11 sec per loop
Patched: 624 msec per loop
$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', recursive=True)"
Unpatched: 1.79 sec per loop
Patched: 281 msec per loop
msg257958 - (view) Author: Ben Hoyt (benhoyt) * Date: 2016年01月11日 14:25
Nice! Good to see scandir() getting used to speed up other functions, functions that folks are using a ton already, like glob().
msg258009 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016年01月11日 20:41
Looks like switching to scandir will also resolve http://bugs.python.org/issue22167 without further action.
msg258015 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016年01月11日 20:56
Not this version of the path. It accumulates all scandir results in a list before yielding values. Otherwise globbing will consume a lot of open file descriptors and could leak them (see issue25994, issue25995).
msg258016 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016年01月11日 20:58
Oh well. Too bad. Never mind then.
msg260098 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016年02月11日 11:57
The patch is completed now. The with statement is used to avoid FD leaks.
msg273912 - (view) Author: Dima Tisnek (Dima.Tisnek) * Date: 2016年08月30日 12:18
Who can review Serhiy's patch?
P.S. core of the patch seems good to me, but I'm not an expert on stdlib.
msg274611 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016年09月06日 19:33
I think the change in general is approved by GvR. If there are some implementation bugs, we can fix them later.
msg274612 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016年09月06日 19:35
New changeset cb7ee9d9cddd by Serhiy Storchaka in branch 'default':
Issue #25596: Optimized glob() and iglob() functions in the
https://hg.python.org/cpython/rev/cb7ee9d9cddd 
msg274696 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016年09月07日 01:26
Thanks Serhiy!
--Guido (mobile)
On Sep 6, 2016 12:35 PM, "Roundup Robot" <report@bugs.python.org> wrote:
>
> Roundup Robot added the comment:
>
> New changeset cb7ee9d9cddd by Serhiy Storchaka in branch 'default':
> Issue #25596: Optimized glob() and iglob() functions in the
> https://hg.python.org/cpython/rev/cb7ee9d9cddd
>
> ----------
> nosy: +python-dev
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue25596>
> _______________________________________
>
msg274770 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016年09月07日 06:50
New changeset 48e573e0a610 by Serhiy Storchaka in branch 'default':
Issue #25596: Falls back to listdir in glob for bytes paths on Windows.
https://hg.python.org/cpython/rev/48e573e0a610 
msg274783 - (view) Author: Dima Tisnek (Dima.Tisnek) * Date: 2016年09月07日 09:09
@Serhiy please comment the implications / limitations of the fallback on Windows.
Is it that scandir cannot handle bytes argument only?
If argument is unicode, but response set contains bytes paths, will that work?
msg274794 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016年09月07日 11:10
os.scandir() cannot handle bytes argument on Windows. If an argument is string, os.scandir() yields entries with string names, if an argument is bytes object, os.scandir() yields entries with bytes names. Opened issue27998 for adding support of bytes paths in os.scandir(). After this will be implemented, a workaround cb7ee9d9cddd can be reverted.
History
Date User Action Args
2022年04月11日 14:58:23adminsetgithub: 69782
2017年02月07日 19:05:40serhiy.storchakalinkissue19240 superseder
2016年09月07日 11:10:52serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg274794

stage: patch review -> resolved
2016年09月07日 09:09:36Dima.Tisneksetmessages: + msg274783
2016年09月07日 06:50:02python-devsetmessages: + msg274770
2016年09月07日 01:26:13gvanrossumsetmessages: + msg274696
2016年09月06日 19:35:28python-devsetnosy: + python-dev
messages: + msg274612
2016年09月06日 19:33:31serhiy.storchakasetmessages: + msg274611
2016年08月30日 12:18:16Dima.Tisneksetnosy: + Dima.Tisnek
messages: + msg273912
2016年02月11日 11:57:18serhiy.storchakasetfiles: + glob_scandir_5.patch

messages: + msg260098
2016年01月11日 20:58:48gvanrossumsetmessages: + msg258016
2016年01月11日 20:56:44serhiy.storchakasetmessages: + msg258015
2016年01月11日 20:41:10gvanrossumsetmessages: + msg258009
2016年01月11日 14:25:58benhoytsetmessages: + msg257958
2016年01月11日 12:22:00serhiy.storchakasettitle: Use scandir() to speed the glob module -> Use scandir() to speed up the glob module
2016年01月11日 12:18:20serhiy.storchakasetmessages: + msg257954
2016年01月11日 08:33:02serhiy.storchakasetfiles: + glob_scandir_4.patch

messages: + msg257953
title: regular files handled as directories in the glob module -> Use scandir() to speed the glob module
2016年01月11日 03:14:18gvanrossumsetmessages: + msg257947
2016年01月10日 21:09:31xdegayesetmessages: + msg257938
2016年01月10日 20:04:58xdegayesetmessages: + msg257930
2016年01月10日 14:53:47xdegayesetfiles: + glob_scandir_3.patch
2016年01月10日 14:53:35xdegayesetfiles: + glob_scandir_2_diff.patch

messages: + msg257912
2016年01月08日 13:53:03xdegayesetfiles: + glob_scandir_doc.patch

messages: + msg257758
2016年01月07日 12:46:07serhiy.storchakalinkissue26032 dependencies
2016年01月07日 04:11:13r.david.murraysetnosy: - r.david.murray
2016年01月06日 23:08:37gvanrossumsetnosy: + gvanrossum
messages: + msg257659
2016年01月06日 22:59:13vstinnersetmessages: + msg257655
2016年01月04日 11:20:36xdegayesetmessages: + msg257456
2016年01月03日 17:20:34serhiy.storchakasetmessages: + msg257424
2016年01月03日 16:30:28xdegayesetmessages: + msg257423
2016年01月03日 14:55:45xdegayesetmessages: + msg257421
2016年01月03日 08:23:29serhiy.storchakasetfiles: + glob_scandir_2.patch

dependencies: + File descriptor leaks in os.scandir()
messages: + msg257403
2016年01月02日 19:49:16r.david.murraysetnosy: + r.david.murray
messages: + msg257354
2016年01月02日 15:52:32serhiy.storchakasetdependencies: + Avoid using private function glob.glob1() in msi module and tools
2016年01月02日 15:22:53serhiy.storchakasetfiles: + glob_scandir.patch
priority: low -> normal

nosy: + vstinner, benhoyt
messages: + msg257347
2015年11月10日 15:58:18serhiy.storchakasetpriority: normal -> low
assignee: serhiy.storchaka

nosy: + serhiy.storchaka
stage: patch review
2015年11月10日 15:35:35xdegayesetfiles: + glob_isdir_2.patch

messages: + msg254449
2015年11月10日 11:18:07xdegayecreate

AltStyle によって変換されたページ (->オリジナル) /