Issue 19051: Unify buffered readers

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/63251

classification

Title:	Unify buffered readers
Type:	enhancement	Stage:	patch review
Components:	IO, Library (Lib)	Versions:	Python 3.4

process

Dependencies:	Superseder:
Status:	open	Resolution:
Assigned To:	Nosy List:	alanmcintyre, benjamin.peterson, martin.panter, nadeem.vawda, pitrou, serhiy.storchaka, stutzbach
Priority:	normal	Keywords:	patch

Created on 2013年09月19日 14:22 by serhiy.storchaka, last changed 2022年04月11日 14:57 by admin.

Files
File name	Uploaded	Description	Edit
buffered_reader.diff	serhiy.storchaka, 2013年09月19日 14:22	review
read_bench.py	serhiy.storchaka, 2013年09月19日 14:33	Benchmark script
read_bench_cmp	serhiy.storchaka, 2013年09月19日 14:34	Benchmark results

Messages (9)
msg198075 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2013年09月19日 14:22
There are some classes in gzip, bz2, lzma, and zipfile modules which implement buffered reader interface. They read chunks of data from underlied file object, decompress it, save in internal buffer, and provide common methods to read from this buffer. Maintaining of duplicated code is cumbersome and error prone. Proposed preliminary patch moves common code into new private class _io2._BufferedReaderMixin. If the proposition will be accepted in general, I'm going to write C version and move it into the io module. Perhaps even then merge it with io.BufferedIOBase. The idea is that all buffered reading functions (read(), read1(), readline(), peek(), etc) can be expressed in the term of one function which returns raw unbuffered data. Subclasses need define only one such function and will got all buffered reader interface. In case of mentioned above classes this functions reads and decompresses a chunk of data from underlied file. The HTTPResponse class perhaps will benefit too (issue19009).
msg198077 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2013年09月19日 14:33
Here are benchmark script and its results.
msg198078 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2013年09月19日 14:47
See issue12053 for a more flexible primitive.
msg198082 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2013年09月19日 15:26
This primitive doesn't not well fit a case of compressed streams. A chunk of compressed data read from underlied file object can be uncompressed to unpredictable large data. We can't limit the size of buffer. Another point is that buffer interface is not very appropriate for Python implementation. And we want left as much Python code in gzip, bz2, lzma and zipfile as possible. Copying from bytes into buffer and back will just waste resources.
msg198143 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2013年09月20日 12:32
If you want this, I think it should be somehow folded into existing classes (for example BufferedIOBase). Yet another implementation of readline() isn't really a good idea.
msg233747 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2015年01月09日 12:22
For what it’s worth, it would be better if compressed streams did limit the amount of data they decompressed, so that they are not susceptible to decompression bombs; see Issue 15955. But having a flexible-sized buffer could be useful in other cases. I haven’t looked closely at the code, but I wonder if there is much difference from the existing BufferedReader. Perhaps just that the underlying raw stream in this case can deliver data in arbitrary-sized chunks, but BufferedReader expects its raw stream to deliver data in limited-sized chunks? If you exposed the buffer it could be useful to do many things more efficiently: * readline() with custom newline or end-of-record codes, solving Issue 1152248, Issue 17083 * scan the buffer using string operations or regular expressions etc, e.g. to skip whitespace, read a run of unescaped symbols * tentatively read data to see if a keyword is present, but roll back if the data doesn’t match the keyword
msg233806 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2015年01月10日 08:57
Parts of the patch here actually do the same thing as my LZMAFile patch for Issue 15955. I wish I had looked at the patch earlier! The difference is I used a proposed max_length parameter for the decompressor rather than unlimited decompression, and I used the existing BufferedReader class rather than implementing a new custom one. The changes for the "gzip" module could probably be merged with my GzipFile patch at Issue 15955 and made to use BufferedReader.
msg240695 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2015年04月13日 18:18
Is it still relevant, now that #23529 has been committed?
msg241403 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2015年04月18日 08:22
The LZMA, gzip and bzip modules now all use BufferedReader, so Serhiy’s patch is no longer relevant for them. Serhiy’s patch also changed the zipfile module, which may be still relevant. On the other hand, perhaps it would be more ideal to use BufferedReader for the zipfile module as well.

History
Date	User	Action	Args
2022年04月11日 14:57:51	admin	set	github: 63251
2015年04月18日 08:22:44	martin.panter	set	messages: + msg241403
2015年04月13日 18:18:54	pitrou	set	messages: + msg240695
2015年01月10日 08:57:31	martin.panter	set	messages: + msg233806
2015年01月09日 12:22:56	martin.panter	set	messages: + msg233747
2014年12月21日 03:08:24	martin.panter	set	nosy: + martin.panter
2013年09月20日 12:32:24	pitrou	set	messages: + msg198143
2013年09月19日 15:46:33	serhiy.storchaka	set	messages: - msg198083
2013年09月19日 15:26:50	serhiy.storchaka	set	messages: + msg198083
2013年09月19日 15:26:00	serhiy.storchaka	set	messages: + msg198082
2013年09月19日 14:47:25	pitrou	set	messages: + msg198078
2013年09月19日 14:34:26	serhiy.storchaka	set	files: + read_bench_cmp
2013年09月19日 14:33:17	serhiy.storchaka	set	files: + read_bench.py messages: + msg198077
2013年09月19日 14:22:09	serhiy.storchaka	create

homepage