This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2012年02月23日 10:46 by kasal, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| Proposed-fix-of-issue14099.patch | kasal, 2012年02月23日 18:50 | proposed fix, against latest source tree | ||
| Proposed-fix-of-issue14099-second.patch | kasal, 2012年02月24日 08:48 | Updated patch | review | |
| zipfile_concurrent_read_1.diff | dw, 2014年11月13日 18:53 | Draft patch (from dw) against 524a004e93dd | review | |
| zipfile_share_file.patch | serhiy.storchaka, 2014年11月14日 20:03 | review | ||
| bench_zip.py | serhiy.storchaka, 2014年11月14日 20:04 | |||
| zipfile_tellable.patch | serhiy.storchaka, 2015年01月16日 19:35 | Support tellable but unseekable files in ZipFile.writestr() | review | |
| zipfile_threadsafe.patch | serhiy.storchaka, 2015年01月16日 19:37 | Makes ZipFile threadsafe | review | |
| Messages (30) | |||
|---|---|---|---|
| msg154058 - (view) | Author: Stepan Kasal (kasal) | Date: 2012年02月23日 10:46 | |
When a file inside a zip is open, the underlying zip file is open again. (Unless the file name is unknown, because the ZipFile object was created with fp only.) This design is incorrect, insecure, and ineffective: - the reopen uses the same string as file name, but on unix-like systems that file name may no longer exist, or may point to a different file - opening n files from the same zip archive consumes n OS file descriptors, wasting resources I believe that the parent ZipFile object and all the child ZipExtFile objects should keep the same fp. The last one would close it. I'm working on a patch currently. |
|||
| msg154077 - (view) | Author: Stepan Kasal (kasal) | Date: 2012年02月23日 18:50 | |
Attached please find a patch that fixes this issue by reusing the original fp from ZipFile object. Two of the test cases attempted to read a file from a zip as soon as write() was called. I believe that this is not correct usage: zip file is not even fully written to disk at that stage! So I took the liberty to change these two test cases so that they first write the file and then read it. Let me thank to Martin Sikora for discovering the issue and to Matej Cepl for testing it on current source tree. |
|||
| msg154110 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2012年02月24日 04:42 | |
Thanks for the report and patch. I’m afraid changing the constructor signature is not an option, due to our backward compatibility policy. Do you think the bug can be fixed without changing the signature, or with new arguments added after the existing ones? |
|||
| msg154123 - (view) | Author: Stepan Kasal (kasal) | Date: 2012年02月24日 08:48 | |
Attached please find a second iteration of the fix. This time the signature of ZipExtFile is kept backward compatible, with one new parameter added. |
|||
| msg176844 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年12月03日 14:09 | |
This issue looks as a duplicate or a superseder of issue10631. See also issue16569. seek() for every read should significantly decrease performance. It may be worth to prohibit the simultaneous reading of different files from the archive. In any case the children counting in the patch looks doubtful. |
|||
| msg176853 - (view) | Author: Stepan Kasal (kasal) | Date: 2012年12月03日 16:45 | |
Re: children counting
You need to know the number of open children and whether the parent ZipFile object is still open.
As soon as both all children and the parent ZipFile are closed, the underlying fp (corresponding to the file name given initially) shall be closed as well.
The code submitted in the patch ensures that. But other implementations are possible.
In any case, it is necessary to ensure that the children stay usable even if the parent ZipFile is closed, because of code like this:
def datafile(self):
with ZipFile(self.datafilezip, "r") as f:
return f.open("data.txt")
This idiom currently works and should not be broken.
Re: seek()
The read can interfere not only with a parallel file expansion, but also with a ZipFile metadata read (user can list the contents of the zip again). Both of these would have to be forbidden by the documentation, and, ideally, also enforced. (As disscussed issue #16569)
OTOH, zipfile.py is already slow, because the decompression is implemented in Python as interpreted code. I guess that the slowdown by seek() is neglectable compared to this.
Also note that we most often seek to the current position; the OS should notice that and return swiftly.
|
|||
| msg176855 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年12月03日 16:50 | |
I think some benchmarks will needed to see how it will affect the performance. Please update your patch to current sources. The module code was changed last months. |
|||
| msg176857 - (view) | Author: Stepan Kasal (kasal) | Date: 2012年12月03日 17:03 | |
I'm not sure when I'll get to this, sorry. Hopefully sometime soon. |
|||
| msg176858 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年12月03日 17:12 | |
> This idiom currently works and should not be broken. Hmm. This seems doubtful to me, but if it is used, then I agree, it shouldn't be broken. > I guess that the slowdown by seek() is neglectable compared to this. Even one function call can have effect on performance of short reads (issue10376, issue16304). Fortunately in this corner case the read buffer will be used. > Also note that we most often seek to the current position; the OS should notice that and return swiftly. This may affect the buffered Python file (I did not check). The OS also doesn't notice this if the OS is Windows (issue8745). I want to see and test an updated patch. |
|||
| msg176859 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年12月03日 17:16 | |
Sorry, not issue16304, but issue16034. The commit messages were wrong. |
|||
| msg231132 - (view) | Author: David Wilson (dw) * | Date: 2014年11月13日 18:53 | |
Per my comment on issue16569, the overhead of performing one seek before each (raw file data) read is quite minimal. I have attached a new (but incomplete) patch, on which the following microbenchmarks are based. The patch is essentially identical to Stepan's 2012 patch, except I haven't yet decided how best to preserve the semantics of ZipFile.close(). "my.zip" is the same my.zip from issue22842. It contains 10,000 files each containing 10 bytes over 2 lines. "my2.zip" contains 8,000 files each containing the same copy of 64kb of /dev/urandom output. The resulting ZIP is 500mb. For each test, the first run is the existing zipfile module, and the second run is with the patch. In summary: * There is a 35% perf increase in str mode when handling many small files (on OS X at least) * There is a 6% perf decrease in file mode when handling small sequential reads. * There is a 2.4% perf decrease in file mode when handling large sequential reads. From my reading of zipfile.py, it is clear there are _many_ ways to improve its performance (probably starting with readline()), and rejection of a functional fix should almost certainly be at the bottom of that list. For each of the tests below, the functions used were: def a(): """ Test concurrent line reads to a str mode ZipFile. """ zf = zipfile.ZipFile('my2.zip') members = [zf.open(n) for n in zf.namelist()] for m in members: m.readline() for m in members: m.readline() def c(): """ Test sequential small reads to a str mode ZipFile. """ zf = zipfile.ZipFile('my2.zip') for name in zf.namelist(): with zf.open(name) as zfp: zfp.read(1000) def d(): """ Test sequential small reads to a file mode ZipFile. """ fp = open('my2.zip', 'rb') zf = zipfile.ZipFile(fp) for name in zf.namelist(): with zf.open(name) as zfp: zfp.read(1000) def e(): """ Test sequential large reads to a file mode ZipFile. """ fp = open('my2.zip', 'rb') zf = zipfile.ZipFile(fp) for name in zf.namelist(): with zf.open(name) as zfp: zfp.read() ---- my.zip ---- $ python3.4 -m timeit -s 'import my' 'my.a()' 10 loops, best of 3: 1.47 sec per loop $ python3.4 -m timeit -s 'import my' 'my.a()' 10 loops, best of 3: 950 msec per loop --- $ python3.4 -m timeit -s 'import my' 'my.c()' 10 loops, best of 3: 1.3 sec per loop $ python3.4 -m timeit -s 'import my' 'my.c()' 10 loops, best of 3: 865 msec per loop --- $ python3.4 -m timeit -s 'import my' 'my.d()' 10 loops, best of 3: 800 msec per loop $ python3.4 -m timeit -s 'import my' 'my.d()' 10 loops, best of 3: 851 msec per loop ---- my2.zip ---- $ python3.4 -m timeit -s 'import my' 'my.a()' 10 loops, best of 3: 1.46 sec per loop $ python3.4 -m timeit -s 'import my' 'my.a()' 10 loops, best of 3: 1.16 sec per loop --- $ python3.4 -m timeit -s 'import my' 'my.c()' 10 loops, best of 3: 1.13 sec per loop $ python3.4 -m timeit -s 'import my' 'my.c()' 10 loops, best of 3: 892 msec per loop --- $ python3.4 -m timeit -s 'import my' 'my.d()' 10 loops, best of 3: 842 msec per loop $ python3.4 -m timeit -s 'import my' 'my.d()' 10 loops, best of 3: 882 msec per loop --- $ python3.4 -m timeit -s 'import my' 'my.e()' 10 loops, best of 3: 1.65 sec per loop $ python3.4 -m timeit -s 'import my' 'my.e()' 10 loops, best of 3: 1.69 sec per loop |
|||
| msg231181 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年11月14日 20:03 | |
Thank you David for your benchmarks and patch. There are several backward compatibility issues with the reading from ZipFile opened for write and from closed ZipFile. This behavior is mostly undocumented (except the reading from closed ZipFile), but even our tests depend on it and changing it could break user code with good chance. Here is a patch which preserves current behavior. Added new tests to check this behavior explicitly. Other advantage of the patch is that it doesn't change the signature of ZipExtFile constructor at all. Benchmarks don't show stable significant difference between patched and unpatched versions. |
|||
| msg231191 - (view) | Author: David Wilson (dw) * | Date: 2014年11月14日 22:00 | |
Hi Serhiy, Thanks for the new patch, it looks better than my attempt. :) |
|||
| msg231473 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年11月21日 14:27 | |
I hesitate about applying the patch to maintained releases. On one hand, besides interface (even non-documented details) left the same, the patch changes interiors too much for ordinal bug. I don't see how it can break something, but this doesn't guarantee that changes don't have unexpected effect. On other hand, this bug can be considered as security-related issue. Malicious local attacker could replace ZIP file between its open and read from it or between two reads, if he has write access to the directory containing ZIP file or there are symplinks under his control in ZIP file path. The danger of this may exceed hypothetical negative consequences of the applying of the patch. I appeal the matter to release managers. Should we apply this patch (the risk is pretty small) to 2.7 and 3.4? |
|||
| msg231475 - (view) | Author: David Wilson (dw) * | Date: 2014年11月21日 14:39 | |
While in spirit this is a bug fix, it's reasonably complex and affects a popular module -- I'm not sure it should be applied to 2.x, and probably not in a minor release of 3.x either. Would it make sense to include as part of 3.5? (That said, I'd love to see this fixed in 2.x ;)) |
|||
| msg231918 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年12月01日 09:21 | |
What your thoughts Benjamin? Should this patch be applied to 2.7.10 (this is not critical for 2.7.9)? |
|||
| msg232032 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2014年12月02日 17:56 | |
Okay for 2.7.10. |
|||
| msg232035 - (view) | Author: David Wilson (dw) * | Date: 2014年12月02日 18:07 | |
Could we also make a small tweak to zipfile.rst indicating the new behaviour? I had made an initial attempt in my patch but wasn't particularly happy with the wording. |
|||
| msg232048 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年12月02日 20:48 | |
How about just "Objects returned by :meth:`.open` can operate independently of the ZipFile."? |
|||
| msg232055 - (view) | Author: David Wilson (dw) * | Date: 2014年12月02日 22:16 | |
Sounds great :) |
|||
| msg232068 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2014年12月03日 07:18 | |
New changeset c2c4cde55f6f by Serhiy Storchaka in branch '2.7': Issue #14099: ZipFile.open() no longer reopen the underlying file. Objects https://hg.python.org/cpython/rev/c2c4cde55f6f New changeset e5bb3044402b by Serhiy Storchaka in branch '3.4': Issue #14099: ZipFile.open() no longer reopen the underlying file. Objects https://hg.python.org/cpython/rev/e5bb3044402b New changeset 334c01aa7f93 by Serhiy Storchaka in branch 'default': Issue #14099: ZipFile.open() no longer reopen the underlying file. Objects https://hg.python.org/cpython/rev/334c01aa7f93 |
|||
| msg232070 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年12月03日 07:30 | |
Thanks Stepan for the idea. |
|||
| msg233553 - (view) | Author: Matt Mackall (Matt.Mackall) | Date: 2015年01月06日 22:27 | |
The committed fix breaks Mercurial. http://bz.selenic.com/show_bug.cgi?id=4492 The "underlying file-like object" in our case is a wsgirequest but anything else trying to serve a dynamically-generated zip file on the web will probably die. We wrapped wsgirequest to support tell() many years ago probably copying someone else's hack, and it's worked fine across Python 2.4-2.7, but we fundamentally can't support all the new seek()s that were added here. |
|||
| msg234142 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年01月16日 19:35 | |
Thank you for your report Matt. There is other problem. It is nowhere documented and newer granted and newer mentioned when ZipFile.open() was added, but file-like objects returned by ZipFile.open() could be read in different threads simultaneously. It makes sense because decompressors release GIL and parallel reading compressed file can has benefit. It is easy to fix both issues (I prefer to do this in separate paths), but due to the overall complexity it is safer to withdraw committed changes in maintained releases and apply additional patches only in default branch. |
|||
| msg234143 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年01月16日 19:37 | |
Adding locks almost not affects performance, because reads are done by relative large chunks and locking overhead is small. |
|||
| msg234146 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年01月16日 19:56 | |
See also issue23252. |
|||
| msg234735 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2015年01月26日 12:02 | |
New changeset ae42c4576438 by Serhiy Storchaka in branch '2.7': Issue #14099: Backout changeset c2c4cde55f6f (except adapted tests). https://hg.python.org/cpython/rev/ae42c4576438 New changeset 680b47c96e08 by Serhiy Storchaka in branch '3.4': Issue #14099: Backout changeset e5bb3044402b (except adapted tests). https://hg.python.org/cpython/rev/680b47c96e08 New changeset 4973ccd46e32 by Serhiy Storchaka in branch 'default': Issue #14099: Writing to ZipFile and reading multiple ZipExtFiles is https://hg.python.org/cpython/rev/4973ccd46e32 New changeset 9cbf9f96920d by Serhiy Storchaka in branch 'default': Issue #14099: Restored support of writing ZIP files to tellable but https://hg.python.org/cpython/rev/9cbf9f96920d |
|||
| msg234736 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年01月26日 12:05 | |
Sorry Stepan and David, but for this feature you need wait 3.5. |
|||
| msg235183 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2015年02月01日 17:04 | |
New changeset 4f96e9a8eee8 by Serhiy Storchaka in branch 'default': Don't seek to the start of the file when open ZipFile with the 'w' mode https://hg.python.org/cpython/rev/4f96e9a8eee8 |
|||
| msg235453 - (view) | Author: Eryk Sun (eryksun) * (Python triager) | Date: 2015年02月05日 21:24 | |
The changeset from 03 Dec is in the Windows 2.7.9 release. Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import zipfile >>> zipfile._SharedFile <class zipfile._SharedFile at 0x0000000000F707C8> |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:27 | admin | set | github: 58307 |
| 2015年02月05日 21:24:35 | eryksun | set | nosy:
+ eryksun messages: + msg235453 |
| 2015年02月01日 17:04:16 | python-dev | set | messages: + msg235183 |
| 2015年01月26日 12:05:57 | serhiy.storchaka | set | status: open -> closed resolution: fixed messages: + msg234736 stage: patch review -> resolved |
| 2015年01月26日 12:02:56 | python-dev | set | messages: + msg234735 |
| 2015年01月16日 19:56:18 | serhiy.storchaka | set | messages: + msg234146 |
| 2015年01月16日 19:38:05 | serhiy.storchaka | set | versions: - Python 2.7, Python 3.4 |
| 2015年01月16日 19:37:37 | serhiy.storchaka | set | files:
+ zipfile_threadsafe.patch messages: + msg234143 |
| 2015年01月16日 19:35:05 | serhiy.storchaka | set | files:
+ zipfile_tellable.patch messages: + msg234142 stage: patch review |
| 2015年01月07日 15:38:00 | durin42 | set | nosy:
+ durin42 |
| 2015年01月07日 00:17:18 | serhiy.storchaka | set | status: closed -> open resolution: fixed -> (no value) stage: resolved -> (no value) |
| 2015年01月06日 22:27:32 | Matt.Mackall | set | nosy:
+ Matt.Mackall messages: + msg233553 |
| 2014年12月03日 07:30:04 | serhiy.storchaka | set | status: open -> closed resolution: fixed messages: + msg232070 stage: patch review -> resolved |
| 2014年12月03日 07:18:41 | python-dev | set | nosy:
+ python-dev messages: + msg232068 |
| 2014年12月02日 22:16:17 | dw | set | messages: + msg232055 |
| 2014年12月02日 20:48:51 | serhiy.storchaka | set | messages: + msg232048 |
| 2014年12月02日 18:07:20 | dw | set | messages: + msg232035 |
| 2014年12月02日 17:56:03 | benjamin.peterson | set | messages: + msg232032 |
| 2014年12月01日 09:21:17 | serhiy.storchaka | set | messages: + msg231918 |
| 2014年11月21日 14:39:20 | dw | set | messages: + msg231475 |
| 2014年11月21日 14:27:41 | serhiy.storchaka | set | nosy:
+ larry, benjamin.peterson messages: + msg231473 |
| 2014年11月14日 22:00:57 | dw | set | messages: + msg231191 |
| 2014年11月14日 20:04:29 | serhiy.storchaka | set | files: + bench_zip.py |
| 2014年11月14日 20:03:05 | serhiy.storchaka | set | files:
+ zipfile_share_file.patch messages: + msg231181 versions: + Python 3.5, - Python 3.2, Python 3.3 |
| 2014年11月13日 18:53:05 | dw | set | files:
+ zipfile_concurrent_read_1.diff nosy: + dw messages: + msg231132 |
| 2012年12月27日 20:56:02 | serhiy.storchaka | set | assignee: serhiy.storchaka |
| 2012年12月03日 17:16:30 | serhiy.storchaka | set | messages: + msg176859 |
| 2012年12月03日 17:12:45 | serhiy.storchaka | set | messages: + msg176858 |
| 2012年12月03日 17:03:54 | kasal | set | messages: + msg176857 |
| 2012年12月03日 16:50:29 | serhiy.storchaka | set | messages: + msg176855 |
| 2012年12月03日 16:45:07 | kasal | set | messages: + msg176853 |
| 2012年12月03日 14:09:49 | serhiy.storchaka | set | nosy:
+ loewis, pitrou, ocean-city, Arfrever, r.david.murray, serhiy.storchaka messages: + msg176844 versions: + Python 3.4 |
| 2012年02月24日 08:48:42 | kasal | set | files:
+ Proposed-fix-of-issue14099-second.patch messages: + msg154123 |
| 2012年02月24日 04:42:10 | eric.araujo | set | versions:
- Python 3.1 type: resource usage -> behavior nosy: + eric.araujo, alanmcintyre title: zipfile: ZipFile.open() should not reopen the underlying file -> ZipFile.open() should not reopen the underlying file messages: + msg154110 stage: patch review |
| 2012年02月23日 18:50:32 | kasal | set | files:
+ Proposed-fix-of-issue14099.patch keywords: + patch messages: + msg154077 |
| 2012年02月23日 11:02:20 | mcepl | set | nosy:
+ mcepl |
| 2012年02月23日 10:46:31 | kasal | create | |