Issue 16350: zlib.Decompress.decompress() after EOF discards existing value of unused_data

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/60554

classification

Title:	zlib.Decompress.decompress() after EOF discards existing value of unused_data
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.2, Python 3.3, Python 3.4, Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	Nosy List:	amaury.forgeotdarc, nadeem.vawda, python-dev, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2012年10月28日 16:25 by nadeem.vawda, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
zlib_unused_data_test.py	nadeem.vawda, 2012年10月28日 16:25
zlib_accum_unused_data.patch	serhiy.storchaka, 2012年10月28日 18:03	review
zlib_accum_unused_data_2.patch	serhiy.storchaka, 2012年10月29日 12:11	review
zlib_unused_data_3.patch	serhiy.storchaka, 2012年11月07日 06:23	review

Messages (18)
msg174056 - (view)	Author: Nadeem Vawda (nadeem.vawda) * (Python committer)	Date: 2012年10月28日 16:25
From issue 5210: amaury.forgeotdarc wrote: > Hm, I tried a modified version of your first test, and I found another > problem with the current zlib library; > starting with the input: > x = x1 + x2 + HAMLET_SCENE # both compressed and uncompressed data > > The following scenario is OK: > dco.decompress(x) # returns HAMLET_SCENE > dco.unused_data # returns HAMLET_SCENE > > But this one: > for c in x: > dco.decompress(x) # will return HAMLET_SCENE, in several pieces > dco.unused_data # only one character, the last of (c in x)! > > This is a bug IMO: unused_data should accumulate all the extra uncompressed > data. Ideally, I would prefer to raise an EOFError if decompress() is called after end-of-stream is reached (for consistency with BZ2Decompressor). However, accumulating the data in unused_data is closer to being backward- compatible, so it's probably the better approach to take.
msg174058 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月28日 17:13
Yes, accumulating the data will be backward-compatible. But what if add a special flag for zlib.decompress, which makes bz2 and lzma compatible decoder? I.e.: 1) unconsumed_tail is always empty (the unconsumed data accumulated in internal buffer, no need manually add it to next chunk of data). Or even non-existent. 2) EOFError raised if decompress() is called after end-of-stream is reached. Of course, it is a new feature.
msg174059 - (view)	Author: Nadeem Vawda (nadeem.vawda) * (Python committer)	Date: 2012年10月28日 17:29
Interesting idea, but I'm not sure it would be worth the effort. It would make the code and API more complicated, so it wouldn't really help users, and would be an added maintenance burden.
msg174064 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月28日 18:03
Here is a patch for your suggestion.
msg174065 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月28日 18:10
Actually the current code contains a bug. If memory allocation for unused_data fails then unused_data will become NULL and using this properties can crash. The patch fixes this.
msg174109 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月29日 12:11
From Nadeem Vawda's mail: > I wasn't suggesting that you try to resize the existing unused_data > object; I agree that this would be a bad idea. What I was thinking of > was something like this: > > size_t old_unused_size = PyBytes_GET_SIZE(self->unused_data); > size_t new_unused_size = old_unused_size + self->zst.avail_in; > PyObject *new_unused_data = PyBytes_FromStringAndSize(NULL, 0); > if (_PyBytes_Resize(&new_unused_data, new_unused_size) < 0) { > Py_DECREF(RetVal); > goto error; > } > memcpy(PyBytes_AS_STRING(new_unused_data), > PyBytes_AS_STRING(self->unused_data), > old_unused_size); > memcpy(PyBytes_AS_STRING(new_unused_data) + old_unused_size, > self->zst.next_in, self->zst.avail_in); > Py_DECREF(self->unused_data); > self->unused_data = new_unused_data; > > Basically, hacking around the fact that (AFAICT) there's no direct way to > allocate an uninitialized bytes object. That way we only do one allocation, > and only copy the new data once. This hacking is not needed, if first argument of PyBytes_FromStringAndSize() is NULL, the contents of the bytes object are uninitialized. And more, this hacking is invalid, because empty bytes object shared and can't be resized. Patch updated. The concatenation optimized as you suggested, one bug fixed.
msg174131 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月29日 16:50
There is yet one memory management bug (with unconsumed_tail). flush() does not update unconsumed_tail and unused_data. >>> import zlib >>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789' >>> dco = zlib.decompressobj() >>> dco.decompress(x, 1) b'a' >>> dco.flush() b'bcdefghijklmnopqrstuvwxyz' >>> dco.unconsumed_tail b'NIMK\xcf\xc8\xcc\xca\xce\xc9\xcd\xcb/(,*.)-+\xaf\xa8\xac\x02\x00\x90\x86\x0b 0123456789' >>> dco.unused_data b'' What should unconsumed_tail be equal after EOF? b'' or unused_data? >>> import zlib >>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789' >>> dco = zlib.decompressobj() >>> dco.decompress(x) b'abcdefghijklmnopqrstuvwxyz' >>> dco.flush() b'' >>> dco.unconsumed_tail b'' >>> dco.unused_data b'0123456789' >>> dco = zlib.decompressobj() >>> dco.decompress(x, 1000) b'abcdefghijklmnopqrstuvwxyz' >>> dco.flush() b'' >>> dco.unconsumed_tail b'0123456789' >>> dco.unused_data b'0123456789'
msg174845 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年11月04日 23:44
New changeset be882735e0b6 by Nadeem Vawda in branch '3.2': Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF. http://hg.python.org/cpython/rev/be882735e0b6 New changeset 4182119c3f0a by Nadeem Vawda in branch '3.3': Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF. http://hg.python.org/cpython/rev/4182119c3f0a New changeset bcdb3836be72 by Nadeem Vawda in branch 'default': Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF. http://hg.python.org/cpython/rev/bcdb3836be72
msg174848 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年11月04日 23:57
New changeset 1c54def5947c by Nadeem Vawda in branch '2.7': Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF. http://hg.python.org/cpython/rev/1c54def5947c
msg174849 - (view)	Author: Nadeem Vawda (nadeem.vawda) * (Python committer)	Date: 2012年11月04日 23:58
Fixed. Thanks for the patch! > This hacking is not needed, if first argument of PyBytes_FromStringAndSize() > is NULL, the contents of the bytes object are uninitialized. Oh, cool. I didn't know about that. > What should unconsumed_tail be equal after EOF? b'' or unused_data? Definitely b''. unconsumed_tail is meant to hold compressed data that should be passed in to the next call to decompress(). If we are at EOF, then decompress() should not be called again, and so it would be misleading to have unconsumed_tail be non-empty.
msg174854 - (view)	Author: Nadeem Vawda (nadeem.vawda) * (Python committer)	Date: 2012年11月05日 00:31
> flush() does not update unconsumed_tail and unused_data. > > >>> import zlib > >>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789' > >>> dco = zlib.decompressobj() > >>> dco.decompress(x, 1) > b'a' > >>> dco.flush() > b'bcdefghijklmnopqrstuvwxyz' > >>> dco.unconsumed_tail > b'NIMK\xcf\xc8\xcc\xca\xce\xc9\xcd\xcb/(,*.)-+\xaf\xa8\xac\x02\x00\x90\x86\x0b 0123456789' > >>> dco.unused_data > b'' I see another bug here - described in issue 16411.
msg174910 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年11月05日 13:58
These were not idle questions. I wrote the patch, and I had to know what behavior is correct. Here's the patch. It fixes potential memory bug (unconsumed_tail sets to NULL in case of out of memory), resets the unconsumed_tail to b'' after EOF, updates unconsumed_tail and unused_data in flush(). I a little changed test for the previous case (here was O(N^2) for large data). I checked it on non-fixed Python, bug was catched.
msg175031 - (view)	Author: Nadeem Vawda (nadeem.vawda) * (Python committer)	Date: 2012年11月06日 23:34
> These were not idle questions. I wrote the patch, and I had to know > what behavior is correct. Ah, sorry. I assumed you were going to submit a separate patch to fix the unconsumed_tail issues. > Here's the patch. It fixes potential memory bug (unconsumed_tail sets > to NULL in case of out of memory), resets the unconsumed_tail to b'' > after EOF, updates unconsumed_tail and unused_data in flush(). Did you perhaps forget to attach the patch? The only ones I see are those that you uploaded last week.
msg175045 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年11月07日 06:23
Sorry. Here is a patch.
msg175116 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年11月07日 18:01
Also note that your variant of check for overflow causes undefined behavior. Python is gradually getting rid of such code.
msg175308 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年11月11日 01:26
New changeset 94256d7804b8 by Nadeem Vawda in branch '2.7': Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush(). http://hg.python.org/cpython/rev/94256d7804b8 New changeset 85886268c40b by Nadeem Vawda in branch '3.2': Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush(). http://hg.python.org/cpython/rev/85886268c40b New changeset 654eb5163ba1 by Nadeem Vawda in branch '3.3': Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush(). http://hg.python.org/cpython/rev/654eb5163ba1 New changeset 4440e45c10f9 by Nadeem Vawda in branch 'default': Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush(). http://hg.python.org/cpython/rev/4440e45c10f9
msg175309 - (view)	Author: Nadeem Vawda (nadeem.vawda) * (Python committer)	Date: 2012年11月11日 01:37
New patch committed. Once again, thanks for all your work on this issue!
msg175336 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年11月11日 09:49
I see you much adjusted the patch. It looks good, thanks you.

History
Date	User	Action	Args
2022年04月11日 14:57:37	admin	set	github: 60554
2012年11月11日 09:49:38	serhiy.storchaka	set	messages: + msg175336
2012年11月11日 01:37:56	nadeem.vawda	set	status: open -> closed messages: + msg175309 stage: patch review -> resolved
2012年11月11日 01:26:27	python-dev	set	messages: + msg175308
2012年11月07日 18:01:38	serhiy.storchaka	set	messages: + msg175116
2012年11月07日 06:23:30	serhiy.storchaka	set	files: + zlib_unused_data_3.patch messages: + msg175045
2012年11月06日 23:34:12	nadeem.vawda	set	messages: + msg175031
2012年11月05日 13:58:46	serhiy.storchaka	set	status: closed -> open messages: + msg174910 stage: resolved -> patch review
2012年11月05日 00:31:40	nadeem.vawda	set	messages: + msg174854
2012年11月04日 23:58:58	nadeem.vawda	set	status: open -> closed resolution: fixed messages: + msg174849 stage: patch review -> resolved
2012年11月04日 23:57:30	python-dev	set	messages: + msg174848
2012年11月04日 23:44:27	python-dev	set	nosy: + python-dev messages: + msg174845
2012年10月29日 16:50:50	serhiy.storchaka	set	messages: + msg174131
2012年10月29日 12:11:33	serhiy.storchaka	set	files: + zlib_accum_unused_data_2.patch messages: + msg174109
2012年10月28日 18:10:38	serhiy.storchaka	set	messages: + msg174065
2012年10月28日 18:03:23	serhiy.storchaka	set	stage: needs patch -> patch review
2012年10月28日 18:03:11	serhiy.storchaka	set	files: + zlib_accum_unused_data.patch keywords: + patch messages: + msg174064
2012年10月28日 17:29:12	nadeem.vawda	set	messages: + msg174059
2012年10月28日 17:13:52	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg174058
2012年10月28日 16:25:19	nadeem.vawda	create

homepage