This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2016年01月06日 22:55 by gvanrossum, last changed 2022年04月11日 14:58 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| pathlib_glob_scandir.patch | serhiy.storchaka, 2016年01月11日 12:20 | review | ||
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 25701 | barneygale, 2021年05月13日 01:37 | ||
| Messages (11) | |||
|---|---|---|---|
| msg257653 - (view) | Author: Guido van Rossum (gvanrossum) * (Python committer) | Date: 2016年01月06日 22:55 | |
The globbing functionality in pathlib (Path.glob() and Path.rglob()) might benefit from using the new optimized os.scandir() interface. It currently just uses os.listdir(). The Path.iterdir() method might also benefit (though less so). There's also a sideways connection with http://bugs.python.org/issue26031 (adding an optional stat cache) -- the cache could possibly keep the DirEntry objects and use their (hopefully cached) attributes. This is more speculative though (and what if the platform's DirEntry doesn't cache?) |
|||
| msg257654 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2016年01月06日 22:58 | |
Related issue: issue #25596 "regular files handled as directories in the glob module". |
|||
| msg257656 - (view) | Author: Ethan Furman (ethan.furman) * (Python committer) | Date: 2016年01月06日 23:00 | |
As I recall, if the platform's DirEntry doesn't provide the cacheable attributes when first called, those attributes will be looked up (and cached) on first access. |
|||
| msg257657 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2016年01月06日 23:03 | |
> As I recall, if the platform's DirEntry doesn't provide the cacheable attributes when first called, those attributes will be looked up (and cached) on first access. scandir() is not magic. It simply provides info given by the OS: see readdir() on UNIX and FindFirstFile()/FindNextFile() on Windows. DirEntry calls os.stat() if needed, but it caches the result. DirEntry doc tries to explain when syscalls or required or not, depending on the requested information and the platform: https://docs.python.org/dev/library/os.html#os.DirEntry |
|||
| msg257660 - (view) | Author: Guido van Rossum (gvanrossum) * (Python committer) | Date: 2016年01月06日 23:12 | |
The DirEntry docs say for most methods "In most cases, no system call is required" which is pretty non-committal. :-( The only firm promise is for inode(), which is pretty useless. |
|||
| msg257664 - (view) | Author: Ben Hoyt (benhoyt) * | Date: 2016年01月07日 01:06 | |
Guido, it's true that in almost all cases you get the speedup (no system call), and it's very much worth using. But the idea with the docs being non-committal is because being specific would make the docs fairly complex. I believe it's as follows for is_file/is_dir/is_symlink: * no system call required on Windows or Unix if the entry is not a symlink * unless you're on Unix with some different file system (maybe a network FS?) where d_type is DT_UNKNOWN * some other edge case which I've probably forgotten :-) Do you think the docs should try to make this more specific? |
|||
| msg257665 - (view) | Author: Guido van Rossum (gvanrossum) * (Python committer) | Date: 2016年01月07日 01:29 | |
Ben, I think it's worth calling out what the rules are around symlinks. I'm guessing the info that is initially present is a subset of lstat(), so if that indicates it's a symlink, is_dir() and is_file() will need a stat() call, *unless* follow_symlinks is False. Another question: for symlinks, there are two different possible stat results: one for stat() and one for lstat(). Are these both cached separately? Or is only one of them? (Experimentally, they are either both cached or the cache remembers the follow_symlinks flag and re-fetches the other result.) Related, "this method always requires a system call", that remark seems to disregard the cache. I'd be happy to review a doc update patch if you make one. |
|||
| msg257679 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2016年01月07日 08:53 | |
"Another question: for symlinks, there are two different possible stat results: one for stat() and one for lstat(). Are these both cached separately?" Hopefully, both are cached. It's directly the result of stat() and stat(follow_symlinks=False) which are cached (so a os.stat_result object). |
|||
| msg257955 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2016年01月11日 12:20 | |
Proposed minimal patch implements globbing in pathlib using os.scandir(). Here are results of microbenchmarks:
$ ./python -m timeit -s "from pathlib import Path; p = Path()" -- "list(p.glob('**/*'))"
Unpatched: 598 msec per loop
Patched: 372 msec per loop
$ ./python -m timeit -s "from pathlib import Path; p = Path('/usr/')" -- "list(p.glob('lib*/**/*'))"
Unpatched: 1.33 sec per loop
Patched: 804 msec per loop
$ ./python -m timeit -s "from pathlib import Path; p = Path('/usr/')" -- "list(p.glob('lib*/**/'))"
Unpatched: 750 msec per loop
Patched: 180 msec per loop
See msg257954 in issue25596 for comparison with the glob module.
|
|||
| msg259283 - (view) | Author: Ben Hoyt (benhoyt) * | Date: 2016年01月31日 16:46 | |
Guido, I've made some tweaks and improvements to the DirEntry docs here: http://bugs.python.org/issue26248 -- the idea is to fix the issues you mentioned to clarify when system calls are required with symlinks, mentioning that the results are cached separately for follow_symlinks True and False, etc. |
|||
| msg274776 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2016年09月07日 07:58 | |
New changeset 927665c4aaab by Serhiy Storchaka in branch 'default': Issue #26032: Optimized globbing in pathlib by using os.scandir(); it is now https://hg.python.org/cpython/rev/927665c4aaab |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:58:25 | admin | set | github: 70220 |
| 2021年05月14日 08:09:53 | vstinner | set | nosy:
- vstinner |
| 2021年05月13日 01:37:33 | barneygale | set | nosy:
+ barneygale pull_requests: + pull_request24730 |
| 2016年09月07日 11:18:35 | serhiy.storchaka | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
| 2016年09月07日 07:58:31 | python-dev | set | nosy:
+ python-dev messages: + msg274776 |
| 2016年01月31日 16:46:05 | benhoyt | set | messages: + msg259283 |
| 2016年01月11日 12:20:21 | serhiy.storchaka | set | files:
+ pathlib_glob_scandir.patch messages: + msg257955 assignee: serhiy.storchaka keywords: + patch stage: patch review |
| 2016年01月07日 17:55:31 | brett.cannon | set | nosy:
+ brett.cannon |
| 2016年01月07日 12:46:07 | serhiy.storchaka | set | dependencies:
+ Use scandir() to speed up the glob module, File descriptor leaks in os.scandir() components: + Library (Lib) versions: - Python 3.5 |
| 2016年01月07日 08:53:40 | vstinner | set | messages: + msg257679 |
| 2016年01月07日 01:29:30 | gvanrossum | set | messages: + msg257665 |
| 2016年01月07日 01:06:53 | benhoyt | set | messages: + msg257664 |
| 2016年01月06日 23:12:11 | gvanrossum | set | messages: + msg257660 |
| 2016年01月06日 23:03:57 | vstinner | set | messages: + msg257657 |
| 2016年01月06日 23:00:01 | ethan.furman | set | nosy:
+ ethan.furman messages: + msg257656 |
| 2016年01月06日 22:58:53 | vstinner | set | nosy:
+ vstinner messages: + msg257654 |
| 2016年01月06日 22:57:35 | vstinner | set | nosy:
+ benhoyt |
| 2016年01月06日 22:57:12 | vstinner | set | nosy:
+ serhiy.storchaka |
| 2016年01月06日 22:56:03 | gvanrossum | set | nosy:
+ pitrou |
| 2016年01月06日 22:55:27 | gvanrossum | create | |