homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: gzip module failing to decompress valid compressed file
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Ericg, iritkatriel, martin.panter, nczeczulin, ned.deily, rhpvorderman
Priority: normal Keywords: patch

Created on 2015年05月27日 15:59 by Ericg, last changed 2022年04月11日 14:58 by admin.

Pull Requests
URL Status Linked Edit
PR 29847 open rhpvorderman, 2021年11月29日 15:28
Messages (9)
msg244188 - (view) Author: EricG (Ericg) Date: 2015年05月27日 15:59
I have a file whose first four bytes are 1F 8B 08 00 and if I use gunzip from the command line, it outputs:
gzip: zImage_extracted.gz: decompression OK, trailing garbage ignored
and correctly decompresses the file. However, if I use the gzip module to read and decompress the data, I get the following exception thrown:
 File "/usr/lib/python3.4/gzip.py", line 360, in read
 while self._read(readsize):
 File "/usr/lib/python3.4/gzip.py", line 433, in _read
 if not self._read_gzip_header():
 File "/usr/lib/python3.4/gzip.py", line 297, in _read_gzip_header
 raise OSError('Not a gzipped file')
I believe the problem I am facing is the same one described here in this SO question and answer:
http://stackoverflow.com/questions/4928560/how-can-i-work-with-gzip-files-which-contain-extra-data
This would appear to be serious bug in the gzip module that needs to be fixed.
msg244214 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2015年05月27日 18:47
Can you add a public copy of a file that fails this way? There are several open issues with gzip, like Issue1159051, that might cover this but it's hard to know for sure without a test case.
msg244230 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015年05月28日 00:26
I suspect Eric’s file has non-zero, non-gzip garbage bytes appended to the end of it. Assuming I am right, here is way to reproduce that scenario:
>>> from gzip import GzipFile
>>> from io import BytesIO
>>> file = BytesIO()
>>> with GzipFile(fileobj=file, mode="wb") as z:
... z.write(b"data")
... 
4
>>> file.write(b"garbage")
7
>>> file.seek(0)
0
>>> GzipFile(fileobj=file).read()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/home/proj/python/cpython/Lib/gzip.py", line 274, in read
 return self._buffer.read(size)
 File "/home/proj/python/cpython/Lib/gzip.py", line 461, in read
 if not self._read_gzip_header():
 File "/home/proj/python/cpython/Lib/gzip.py", line 409, in _read_gzip_header
 raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'ga')
This is a bit different to Issue 1508475. That one is about cases where the "gzip" trailer has been truncated, although the compressed data is probably intact. This case is the converse: extra data has been added.
All of the "gzip", "bzip2" and XZ Utils (for LZMA) command-line decompressors happily extract the compressed data without an error exit status, but emit warning messages:
gzip: stdin: decompression OK, trailing garbage ignored
bzip2: (stdin): trailing garbage after EOF ignored
xz: (stdin): Unexpected end of input
In Python, the "bzip" and LZMA modules successfully extract the compressed data, and ignore the non-compressed garbage at the end without even a warning. On the other hand, the "gzip" module has special code to ignore trailing zero bytes (Issue 2846), but treats any other trailing non-gzip data as an error.
So I think a strong argument could be made for the ability to extract all the compressed data from even if there is garbage appended. The question is, how would this support be added? Perhaps the mechanism chosen could also be integrated with a fix for Issue 1508475. Some options:
* Silently ignore the condition by default like the other compression modules (consistent, but could silently swallow real errors)
* An optional new GzipFile(strict=False) mode
* Perhaps an exception deferred until close() is called
msg245368 - (view) Author: Nick Czeczulin (nczeczulin) Date: 2015年06月15日 06:58
The spec allows for multi-member files. Some libraries and utilities seem to solve this problem (incorrectly?) by simply ignoring everything past the first member -- even when valid (e.g., DotNetZip, 7-Zip)
For 2.7 and 3.4, the data that has been decompressed but not yet read before the exception was raised is still available:
Modifying Martin's example slightly:
>>> f = BytesIO()
>>> with GzipFile(fileobj=f, mode="wb") as z:
... z.write(b"data")
...
4
>>> f.write(b"garbage")
7
>>> f.seek(0)
0
>>> with GzipFile(fileobj=f, mode="rb") as z:
... try:
... z.read(1)
... z.read()
... except OSError as e:
... z.extrabuf[z.offset - z.extrastart:]
... e
...
b'd'
b'ata'
OSError('Not a gzipped file',)
My issue is that catching and handling this specific exception is a little more involved because there are 3(?) different OSErrors (IOError on 2.7) that could potentially be raised during the read. But mostly:
OSError('CRC check failed 0x447ba3f9 != 0x225cb2a3',) -- would be bad one to mistake for it.
Maybe a specific Exception type to catch for an invalid header, and a better method to read the remaining buffer when handling it?
msg245369 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015年06月15日 08:17
Just noticed in my previous message I mentioned Issue 1508475 a few times when I meant to say Issue 1159051.
In Python 3.5, a workaround is not so easy because we would need to access the internal buffer of a BufferedReader. One potential workaround is to use read1():
>>> z.read1(1)
b'd'
>>> z.read1()
b'ata'
>>> z.read1()
OSError: Not a gzipped file (b'ga')
The only practical way to allow for an exception and read() returning all the data is to defer the exception until close() is called. Another option might be to store a list of defects, similar to "email.message.Message.defects".
msg407148 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021年11月27日 14:47
Reproduced on 3.11:
>>> from gzip import GzipFile
>>> from io import BytesIO
>>> file = BytesIO()
>>> with GzipFile(fileobj=file, mode="wb") as z:
... z.write(b"data")
... 
4
>>> file.write(b"garbage")
7
>>> file.seek(0)
0
>>> GzipFile(fileobj=file).read()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/Users/iritkatriel/src/cpython-654/Lib/gzip.py", line 301, in read
 return self._buffer.read(size)
 ^^^^^^^^^^^^^^^^^^^^^^^
 File "/Users/iritkatriel/src/cpython-654/Lib/_compression.py", line 118, in readall
 while data := self.read(sys.maxsize):
 ^^^^^^^^^^^^^^^^^^^^^^
 File "/Users/iritkatriel/src/cpython-654/Lib/gzip.py", line 499, in read
 if not self._read_gzip_header():
 ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Users/iritkatriel/src/cpython-654/Lib/gzip.py", line 468, in _read_gzip_header
 last_mtime = _read_gzip_header(self._fp)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Users/iritkatriel/src/cpython-654/Lib/gzip.py", line 428, in _read_gzip_header
 raise BadGzipFile('Not a gzipped file (%r)' % magic)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
gzip.BadGzipFile: Not a gzipped file (b'ga')
msg407280 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021年11月29日 14:37
From the spec:
https://datatracker.ietf.org/doc/html/rfc1952
 2.2. File format
 A gzip file consists of a series of "members" (compressed data
 sets). The format of each member is specified in the following
 section. The members simply appear one after another in the file,
 with no additional information before, between, or after them.
Gzip files with garbage after them are corrupted or not spec compliant. Therefore the gzip module should raise an error in this case.
msg407282 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021年11月29日 14:53
Whoops. Sorry, I spoke before my turn. If gzip implements it, it seems only logical that python's *gzip* module should too. 
I believe it can be fixed quite easily. The code should raise a warning though. I will make a PR.
msg409410 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021年12月31日 09:44
ping
History
Date User Action Args
2022年04月11日 14:58:17adminsetgithub: 68489
2021年12月31日 09:44:00rhpvordermansetmessages: + msg409410
2021年11月29日 15:28:28rhpvordermansetkeywords: + patch
stage: patch review
pull_requests: + pull_request28076
2021年11月29日 14:53:12rhpvordermansetmessages: + msg407282
2021年11月29日 14:37:50rhpvordermansetnosy: + rhpvorderman
messages: + msg407280
2021年11月27日 14:47:00iritkatrielsetversions: + Python 3.9, Python 3.10, Python 3.11, - Python 3.4
nosy: + iritkatriel

messages: + msg407148

type: behavior
2015年06月15日 08:17:58martin.pantersetmessages: + msg245369
components: + Library (Lib), - Extension Modules
2015年06月15日 06:58:54nczeczulinsetnosy: + nczeczulin
messages: + msg245368
2015年05月28日 00:26:40martin.pantersetnosy: + martin.panter
messages: + msg244230
2015年05月27日 18:47:45ned.deilysettype: crash -> (no value)

messages: + msg244214
nosy: + ned.deily
2015年05月27日 15:59:03Ericgcreate

AltStyle によって変換されたページ (->オリジナル) /