This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2012年10月28日 16:25 by nadeem.vawda, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| zlib_unused_data_test.py | nadeem.vawda, 2012年10月28日 16:25 | |||
| zlib_accum_unused_data.patch | serhiy.storchaka, 2012年10月28日 18:03 | review | ||
| zlib_accum_unused_data_2.patch | serhiy.storchaka, 2012年10月29日 12:11 | review | ||
| zlib_unused_data_3.patch | serhiy.storchaka, 2012年11月07日 06:23 | review | ||
| Messages (18) | |||
|---|---|---|---|
| msg174056 - (view) | Author: Nadeem Vawda (nadeem.vawda) * (Python committer) | Date: 2012年10月28日 16:25 | |
From issue 5210: amaury.forgeotdarc wrote: > Hm, I tried a modified version of your first test, and I found another > problem with the current zlib library; > starting with the input: > x = x1 + x2 + HAMLET_SCENE # both compressed and uncompressed data > > The following scenario is OK: > dco.decompress(x) # returns HAMLET_SCENE > dco.unused_data # returns HAMLET_SCENE > > But this one: > for c in x: > dco.decompress(x) # will return HAMLET_SCENE, in several pieces > dco.unused_data # only one character, the last of (c in x)! > > This is a bug IMO: unused_data should accumulate all the extra uncompressed > data. Ideally, I would prefer to raise an EOFError if decompress() is called after end-of-stream is reached (for consistency with BZ2Decompressor). However, accumulating the data in unused_data is closer to being backward- compatible, so it's probably the better approach to take. |
|||
| msg174058 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年10月28日 17:13 | |
Yes, accumulating the data will be backward-compatible. But what if add a special flag for zlib.decompress, which makes bz2 and lzma compatible decoder? I.e.: 1) unconsumed_tail is always empty (the unconsumed data accumulated in internal buffer, no need manually add it to next chunk of data). Or even non-existent. 2) EOFError raised if decompress() is called after end-of-stream is reached. Of course, it is a new feature. |
|||
| msg174059 - (view) | Author: Nadeem Vawda (nadeem.vawda) * (Python committer) | Date: 2012年10月28日 17:29 | |
Interesting idea, but I'm not sure it would be worth the effort. It would make the code and API more complicated, so it wouldn't really help users, and would be an added maintenance burden. |
|||
| msg174064 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年10月28日 18:03 | |
Here is a patch for your suggestion. |
|||
| msg174065 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年10月28日 18:10 | |
Actually the current code contains a bug. If memory allocation for unused_data fails then unused_data will become NULL and using this properties can crash. The patch fixes this. |
|||
| msg174109 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年10月29日 12:11 | |
From Nadeem Vawda's mail:
> I wasn't suggesting that you try to resize the existing unused_data
> object; I agree that this would be a bad idea. What I was thinking of
> was something like this:
>
> size_t old_unused_size = PyBytes_GET_SIZE(self->unused_data);
> size_t new_unused_size = old_unused_size + self->zst.avail_in;
> PyObject *new_unused_data = PyBytes_FromStringAndSize(NULL, 0);
> if (_PyBytes_Resize(&new_unused_data, new_unused_size) < 0) {
> Py_DECREF(RetVal);
> goto error;
> }
> memcpy(PyBytes_AS_STRING(new_unused_data),
> PyBytes_AS_STRING(self->unused_data),
> old_unused_size);
> memcpy(PyBytes_AS_STRING(new_unused_data) + old_unused_size,
> self->zst.next_in, self->zst.avail_in);
> Py_DECREF(self->unused_data);
> self->unused_data = new_unused_data;
>
> Basically, hacking around the fact that (AFAICT) there's no direct way to
> allocate an uninitialized bytes object. That way we only do one allocation,
> and only copy the new data once.
This hacking is not needed, if first argument of PyBytes_FromStringAndSize() is NULL, the contents of the bytes object are uninitialized. And more, this hacking is invalid, because empty bytes object shared and can't be resized.
Patch updated. The concatenation optimized as you suggested, one bug fixed.
|
|||
| msg174131 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年10月29日 16:50 | |
There is yet one memory management bug (with unconsumed_tail). flush() does not update unconsumed_tail and unused_data. >>> import zlib >>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789' >>> dco = zlib.decompressobj() >>> dco.decompress(x, 1) b'a' >>> dco.flush() b'bcdefghijklmnopqrstuvwxyz' >>> dco.unconsumed_tail b'NIMK\xcf\xc8\xcc\xca\xce\xc9\xcd\xcb/(,*.)-+\xaf\xa8\xac\x02\x00\x90\x86\x0b 0123456789' >>> dco.unused_data b'' What should unconsumed_tail be equal after EOF? b'' or unused_data? >>> import zlib >>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789' >>> dco = zlib.decompressobj() >>> dco.decompress(x) b'abcdefghijklmnopqrstuvwxyz' >>> dco.flush() b'' >>> dco.unconsumed_tail b'' >>> dco.unused_data b'0123456789' >>> dco = zlib.decompressobj() >>> dco.decompress(x, 1000) b'abcdefghijklmnopqrstuvwxyz' >>> dco.flush() b'' >>> dco.unconsumed_tail b'0123456789' >>> dco.unused_data b'0123456789' |
|||
| msg174845 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2012年11月04日 23:44 | |
New changeset be882735e0b6 by Nadeem Vawda in branch '3.2': Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF. http://hg.python.org/cpython/rev/be882735e0b6 New changeset 4182119c3f0a by Nadeem Vawda in branch '3.3': Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF. http://hg.python.org/cpython/rev/4182119c3f0a New changeset bcdb3836be72 by Nadeem Vawda in branch 'default': Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF. http://hg.python.org/cpython/rev/bcdb3836be72 |
|||
| msg174848 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2012年11月04日 23:57 | |
New changeset 1c54def5947c by Nadeem Vawda in branch '2.7': Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF. http://hg.python.org/cpython/rev/1c54def5947c |
|||
| msg174849 - (view) | Author: Nadeem Vawda (nadeem.vawda) * (Python committer) | Date: 2012年11月04日 23:58 | |
Fixed. Thanks for the patch! > This hacking is not needed, if first argument of PyBytes_FromStringAndSize() > is NULL, the contents of the bytes object are uninitialized. Oh, cool. I didn't know about that. > What should unconsumed_tail be equal after EOF? b'' or unused_data? Definitely b''. unconsumed_tail is meant to hold compressed data that should be passed in to the next call to decompress(). If we are at EOF, then decompress() should not be called again, and so it would be misleading to have unconsumed_tail be non-empty. |
|||
| msg174854 - (view) | Author: Nadeem Vawda (nadeem.vawda) * (Python committer) | Date: 2012年11月05日 00:31 | |
> flush() does not update unconsumed_tail and unused_data. > > >>> import zlib > >>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789' > >>> dco = zlib.decompressobj() > >>> dco.decompress(x, 1) > b'a' > >>> dco.flush() > b'bcdefghijklmnopqrstuvwxyz' > >>> dco.unconsumed_tail > b'NIMK\xcf\xc8\xcc\xca\xce\xc9\xcd\xcb/(,*.)-+\xaf\xa8\xac\x02\x00\x90\x86\x0b 0123456789' > >>> dco.unused_data > b'' I see another bug here - described in issue 16411. |
|||
| msg174910 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年11月05日 13:58 | |
These were not idle questions. I wrote the patch, and I had to know what behavior is correct. Here's the patch. It fixes potential memory bug (unconsumed_tail sets to NULL in case of out of memory), resets the unconsumed_tail to b'' after EOF, updates unconsumed_tail and unused_data in flush(). I a little changed test for the previous case (here was O(N^2) for large data). I checked it on non-fixed Python, bug was catched. |
|||
| msg175031 - (view) | Author: Nadeem Vawda (nadeem.vawda) * (Python committer) | Date: 2012年11月06日 23:34 | |
> These were not idle questions. I wrote the patch, and I had to know > what behavior is correct. Ah, sorry. I assumed you were going to submit a separate patch to fix the unconsumed_tail issues. > Here's the patch. It fixes potential memory bug (unconsumed_tail sets > to NULL in case of out of memory), resets the unconsumed_tail to b'' > after EOF, updates unconsumed_tail and unused_data in flush(). Did you perhaps forget to attach the patch? The only ones I see are those that you uploaded last week. |
|||
| msg175045 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年11月07日 06:23 | |
Sorry. Here is a patch. |
|||
| msg175116 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年11月07日 18:01 | |
Also note that your variant of check for overflow causes undefined behavior. Python is gradually getting rid of such code. |
|||
| msg175308 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2012年11月11日 01:26 | |
New changeset 94256d7804b8 by Nadeem Vawda in branch '2.7': Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush(). http://hg.python.org/cpython/rev/94256d7804b8 New changeset 85886268c40b by Nadeem Vawda in branch '3.2': Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush(). http://hg.python.org/cpython/rev/85886268c40b New changeset 654eb5163ba1 by Nadeem Vawda in branch '3.3': Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush(). http://hg.python.org/cpython/rev/654eb5163ba1 New changeset 4440e45c10f9 by Nadeem Vawda in branch 'default': Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush(). http://hg.python.org/cpython/rev/4440e45c10f9 |
|||
| msg175309 - (view) | Author: Nadeem Vawda (nadeem.vawda) * (Python committer) | Date: 2012年11月11日 01:37 | |
New patch committed. Once again, thanks for all your work on this issue! |
|||
| msg175336 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年11月11日 09:49 | |
I see you much adjusted the patch. It looks good, thanks you. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:37 | admin | set | github: 60554 |
| 2012年11月11日 09:49:38 | serhiy.storchaka | set | messages: + msg175336 |
| 2012年11月11日 01:37:56 | nadeem.vawda | set | status: open -> closed messages: + msg175309 stage: patch review -> resolved |
| 2012年11月11日 01:26:27 | python-dev | set | messages: + msg175308 |
| 2012年11月07日 18:01:38 | serhiy.storchaka | set | messages: + msg175116 |
| 2012年11月07日 06:23:30 | serhiy.storchaka | set | files:
+ zlib_unused_data_3.patch messages: + msg175045 |
| 2012年11月06日 23:34:12 | nadeem.vawda | set | messages: + msg175031 |
| 2012年11月05日 13:58:46 | serhiy.storchaka | set | status: closed -> open messages: + msg174910 stage: resolved -> patch review |
| 2012年11月05日 00:31:40 | nadeem.vawda | set | messages: + msg174854 |
| 2012年11月04日 23:58:58 | nadeem.vawda | set | status: open -> closed resolution: fixed messages: + msg174849 stage: patch review -> resolved |
| 2012年11月04日 23:57:30 | python-dev | set | messages: + msg174848 |
| 2012年11月04日 23:44:27 | python-dev | set | nosy:
+ python-dev messages: + msg174845 |
| 2012年10月29日 16:50:50 | serhiy.storchaka | set | messages: + msg174131 |
| 2012年10月29日 12:11:33 | serhiy.storchaka | set | files:
+ zlib_accum_unused_data_2.patch messages: + msg174109 |
| 2012年10月28日 18:10:38 | serhiy.storchaka | set | messages: + msg174065 |
| 2012年10月28日 18:03:23 | serhiy.storchaka | set | stage: needs patch -> patch review |
| 2012年10月28日 18:03:11 | serhiy.storchaka | set | files:
+ zlib_accum_unused_data.patch keywords: + patch messages: + msg174064 |
| 2012年10月28日 17:29:12 | nadeem.vawda | set | messages: + msg174059 |
| 2012年10月28日 17:13:52 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg174058 |
| 2012年10月28日 16:25:19 | nadeem.vawda | create | |