This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2013年05月17日 22:27 by Michael.Fox, last changed 2022年04月11日 14:57 by admin.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| decomp-optim.patch | martin.panter, 2015年06月01日 13:51 | review | ||
| decomp-optim.v2.patch | martin.panter, 2015年06月03日 11:12 | review | ||
| decomp-optim.v3.patch | serhiy.storchaka, 2015年06月06日 14:19 | review | ||
| decomp-optim.v4.patch | martin.panter, 2015年06月09日 13:41 | review | ||
| Messages (44) | |||
|---|---|---|---|
| msg189488 - (view) | Author: Michael Fox (Michael.Fox) | Date: 2013年05月17日 22:27 | |
import lzma
count = 0
f = lzma.LZMAFile('bigfile.xz' ,'r')
for line in f:
count += 1
print(count)
Comparing python2 with pyliblzma to python3.3.1 with the new lzma:
m@air:~/q/topaz/parse_datalog$ time python lzmaperf.py
102368
real 0m0.062s
user 0m0.056s
sys 0m0.004s
m@air:~/q/topaz/parse_datalog$ time python3 lzmaperf.py
102368
real 0m7.506s
user 0m7.484s
sys 0m0.012s
Profiling shows most of the time is spent here:
102371 6.881 0.000 6.972 0.000 lzma.py:247(_read_block)
I also notice that reading the entire file into memory with f.read() is perfectly fast.
I think it has something to do with lack of buffering.
|
|||
| msg189542 - (view) | Author: Nadeem Vawda (nadeem.vawda) * (Python committer) | Date: 2013年05月18日 19:46 | |
Have you tried running the benchmark against the default (3.4) branch? There was some significant optimization work done in issue 16034, but the changes were not backported to 3.3. |
|||
| msg189550 - (view) | Author: Michael Fox (Michael.Fox) | Date: 2013年05月18日 21:12 | |
3.4 is much better but still 4x slower than 2.7 m@air:~/q/topaz/parse_datalog$ time python2.7 lzmaperf.py 102368 real 0m0.053s user 0m0.052s sys 0m0.000s m@air:~/q/topaz/parse_datalog$ time ~/tmp/cpython-23836f17e4a2/bin/python3.4 lzmaperf.py 102368 real 0m0.229s user 0m0.212s sys 0m0.012s The bottleneck has moved here: 102369 0.151 0.000 0.226 0.000 lzma.py:333(readline) I don't know if this is a strictly fair comparison. The lzma module and pyliblzma may not be of the same quality. I've just come across a real bug in pyliblzma. It doesn't apply to this test, but who knows what shortcuts it's taking. Finally, here's a baseline: m@air:~/q/topaz/parse_datalog$ time xzcat bigfile.xz | wc -l 102368 real 0m0.034s user 0m0.024s sys 0m0.016s On Sat, May 18, 2013 at 12:46 PM, Nadeem Vawda <report@bugs.python.org> wrote: > > Nadeem Vawda added the comment: > > Have you tried running the benchmark against the default (3.4) branch? > There was some significant optimization work done in issue 16034, but > the changes were not backported to 3.3. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue18003> > _______________________________________ -- - Michael |
|||
| msg189552 - (view) | Author: Michael Fox (Michael.Fox) | Date: 2013年05月18日 21:48 | |
I looked into it a little and it looks like pyliblzma is a pure C extension whereas new lzma library wraps liblzma but the rest is python. In particular this happens for every line: if size < 0: end = self._buffer.find(b"\n", self._buffer_offset) + 1 if end > 0: line = self._buffer[self._buffer_offset : end] self._buffer_offset = end self._pos += len(line) return line And while that doesn't look like a lot of overhead, it's definitely something. So, unless someone thinks that a pure C extension is the right technical direction, lzma in 3.4 is probably as fast as it's ever going to be. I will just use the workaround of piping in unxz regardless. On Sat, May 18, 2013 at 2:12 PM, Michael Fox <415fox@gmail.com> wrote: > 3.4 is much better but still 4x slower than 2.7 > > m@air:~/q/topaz/parse_datalog$ time python2.7 lzmaperf.py > 102368 > > real 0m0.053s > user 0m0.052s > sys 0m0.000s > m@air:~/q/topaz/parse_datalog$ time > ~/tmp/cpython-23836f17e4a2/bin/python3.4 lzmaperf.py > 102368 > > real 0m0.229s > user 0m0.212s > sys 0m0.012s > > The bottleneck has moved here: > 102369 0.151 0.000 0.226 0.000 lzma.py:333(readline) > > I don't know if this is a strictly fair comparison. The lzma module > and pyliblzma may not be of the same quality. I've just come across a > real bug in pyliblzma. It doesn't apply to this test, but who knows > what shortcuts it's taking. > > Finally, here's a baseline: > > m@air:~/q/topaz/parse_datalog$ time xzcat bigfile.xz | wc -l > 102368 > > real 0m0.034s > user 0m0.024s > sys 0m0.016s > > On Sat, May 18, 2013 at 12:46 PM, Nadeem Vawda <report@bugs.python.org> wrote: >> >> Nadeem Vawda added the comment: >> >> Have you tried running the benchmark against the default (3.4) branch? >> There was some significant optimization work done in issue 16034, but >> the changes were not backported to 3.3. >> >> ---------- >> >> _______________________________________ >> Python tracker <report@bugs.python.org> >> <http://bugs.python.org/issue18003> >> _______________________________________ > > > > -- > > - > Michael -- - Michael |
|||
| msg189553 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年05月18日 22:11 | |
Try `f = io.BufferedReader(f)`. |
|||
| msg189567 - (view) | Author: Raymond Hettinger (rhettinger) * (Python committer) | Date: 2013年05月19日 03:38 | |
> So, unless someone thinks that a pure C extension is the > right technical direction, lzma in 3.4 is probably as fast > as it's ever going to be. I would support the inclusion of a C extension. Reasonable performance is a prerequisite for broader adoption. |
|||
| msg189568 - (view) | Author: Raymond Hettinger (rhettinger) * (Python committer) | Date: 2013年05月19日 03:47 | |
Serhiy, would you like to take this one? |
|||
| msg189575 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年05月19日 09:02 | |
I'm against implementing LZMAFile in a pure C. It was a great win that LZMAFile had implemented in a pure Python. However may be we could reuse the existing accelerated implementation of io.BufferedReader. |
|||
| msg189592 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2013年05月19日 14:07 | |
I second Serhiy here. Wrapping the LZMAFile in a BufferedReader is the simple solution to the performance problem:
./python -m timeit -s "import lzma, io" "f=lzma.LZMAFile('words.xz', 'r')" "for line in f: pass"
10 loops, best of 3: 148 msec per loop
$ ./python -m timeit -s "import lzma, io" "f=io.BufferedReader(lzma.LZMAFile('words.xz', 'r'))" "for line in f: pass"
10 loops, best of 3: 44.3 msec per loop
$ time xzcat words.xz | wc -l
99156
real 0m0.021s
user 0m0.016s
sys 0m0.004s
Perhaps the top-level lzma.open() should do the wrapping for you, though.
Interestingly, opening in text (i.e. unicode) mode is almost as fast as with a BufferedReader:
$ ./python -m timeit -s "import lzma, io" "f=lzma.open('words.xz', 'rt')" "for line in f: pass"
10 loops, best of 3: 51.1 msec per loop
|
|||
| msg189605 - (view) | Author: Michael Fox (Michael.Fox) | Date: 2013年05月19日 16:49 | |
io.BufferedReader works well for me. Thanks for the good suggestion. Now python 3.3 and 3.4 have similar performance to each other and they are only 2x slower than pyliblzma. From my perspective default wrapping with io.BufferedReader is a great idea. I can't think of who would suffer. Maybe someone who wants to open thousands of simultaneous streams wouldn't appreciate the memory overhead. If that person exists then he would want an option to turn it off. m@air:~/q/topaz/parse_datalog$ time python2 lzmaperf.py 102368 real 0m0.049s user 0m0.040s sys 0m0.008s m@air:~/q/topaz/parse_datalog$ time python3 lzmaperf.py 102368 real 0m0.109s user 0m0.092s sys 0m0.020s m@air:~/q/topaz/parse_datalog$ time ~/tmp/cpython-23836f17e4a2/bin/python3 lzmaperf.py 102368 real 0m0.101s user 0m0.084s sys 0m0.012s On Sun, May 19, 2013 at 7:07 AM, Antoine Pitrou <report@bugs.python.org> wrote: > > Antoine Pitrou added the comment: > > I second Serhiy here. Wrapping the LZMAFile in a BufferedReader is the simple solution to the performance problem: > > ./python -m timeit -s "import lzma, io" "f=lzma.LZMAFile('words.xz', 'r')" "for line in f: pass" > 10 loops, best of 3: 148 msec per loop > > $ ./python -m timeit -s "import lzma, io" "f=io.BufferedReader(lzma.LZMAFile('words.xz', 'r'))" "for line in f: pass" > 10 loops, best of 3: 44.3 msec per loop > > $ time xzcat words.xz | wc -l > 99156 > > real 0m0.021s > user 0m0.016s > sys 0m0.004s > > > Perhaps the top-level lzma.open() should do the wrapping for you, though. > Interestingly, opening in text (i.e. unicode) mode is almost as fast as with a BufferedReader: > > $ ./python -m timeit -s "import lzma, io" "f=lzma.open('words.xz', 'rt')" "for line in f: pass" > 10 loops, best of 3: 51.1 msec per loop > > ---------- > nosy: +pitrou > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue18003> > _______________________________________ -- - Michael |
|||
| msg189611 - (view) | Author: Nadeem Vawda (nadeem.vawda) * (Python committer) | Date: 2013年05月19日 17:52 | |
I agree that making lzma.open() wrap its return value in a BufferedReader (or BufferedWriter, as appropriate) is the way to go. I'm currently travelling and don't have my SSH key with me - Serhiy, can you make the change? I'll put together a documentation patch that recommends using lzma.open() rather than LZMAFile directly, and mentions the performance implications. > Interestingly, opening in text (i.e. unicode) mode is almost as fast as with a BufferedReader: This is because opening in text mode returns a TextIOWrapper, which is written in C, and presumably does its own buffering on top of LZMAFile.read1() instead of calling LZMAFile.readline(). > From my perspective default wrapping with io.BufferedReader is a great > idea. I can't think of who would suffer. Maybe someone who wants to > open thousands of simultaneous streams wouldn't appreciate the memory > overhead. If that person exists then he would want an option to turn > it off. If someone doesn't want the BufferedReader/BufferedWriter, they can create an LZMAFile directly; we don't plan to remove that possibility. So I don't think that should be a problem. |
|||
| msg189616 - (view) | Author: Nadeem Vawda (nadeem.vawda) * (Python committer) | Date: 2013年05月19日 18:50 | |
> I agree that making lzma.open() wrap its return value in a BufferedReader
> (or BufferedWriter, as appropriate) is the way to go.
On second thoughts, there's no need to change the behavior for mode='wb'.
We can just return a BufferedReader for mode='rb', and leave the current
behavior (returning a raw LZMAFile) in place for mode='wb'.
I also ran some additional benchmarks for the bz2 and gzip modules. It
looks like those two modules would also benefit from having their open()
functions use io.BufferedReader:
[lzma]
$ time xzcat src.xz | wc -l
1057980
real 0m0.543s
user 0m0.556s
sys 0m0.024s
$ ../cpython/python -m timeit -s 'import lzma, io' 'f = lzma.open("src.xz", "r")' 'for line in f: pass'
10 loops, best of 3: 2.01 sec per loop
$ ../cpython/python -m timeit -s 'import lzma, io' 'f = io.BufferedReader(lzma.open("src.xz", "r"))' 'for line in f: pass'
10 loops, best of 3: 795 msec per loop
[bz2]
$ time bzcat src.bz2 | wc -l
1057980
real 0m1.322s
user 0m1.324s
sys 0m0.044s
$ ../cpython/python -m timeit -s 'import bz2, io' 'f = bz2.open("src.bz2", "r")' 'for line in f: pass'
10 loops, best of 3: 3.71 sec per loop
$ ../cpython/python -m timeit -s 'import bz2, io' 'f = io.BufferedReader(bz2.open("src.bz2", "r"))' 'for line in f: pass'
10 loops, best of 3: 2.04 sec per loop
[gzip]
$ time zcat src.gz | wc -l
1057980
real 0m0.310s
user 0m0.296s
sys 0m0.028s
$ ../cpython/python -m timeit -s 'import gzip, io' 'f = gzip.open("src.gz", "r")' 'for line in f: pass'
10 loops, best of 3: 1.94 sec per loop
$ ../cpython/python -m timeit -s 'import gzip, io' 'f = io.BufferedReader(gzip.open("src.gz", "r"))' 'for line in f: pass'
10 loops, best of 3: 556 msec per loop
|
|||
| msg189665 - (view) | Author: Michael Fox (Michael.Fox) | Date: 2013年05月20日 14:28 | |
I was thinking about this line: end = self._buffer.find(b"\n", self._buffer_offset) + 1 Might be a bug? For example, is there a unicode where one of several bytes is '\n'? In this case it splits the line in the middle of a character, right? On Sun, May 19, 2013 at 3:28 PM, Arfrever Frehtes Taifersar Arahesis <report@bugs.python.org> wrote: > > Changes by Arfrever Frehtes Taifersar Arahesis <Arfrever.FTA@GMail.Com>: > > > ---------- > nosy: +Arfrever > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue18003> > _______________________________________ -- - Michael |
|||
| msg189668 - (view) | Author: Nadeem Vawda (nadeem.vawda) * (Python committer) | Date: 2013年05月20日 14:42 | |
No, that is the intended behavior for binary streams - they operate at the level of individual byes. If you want to treat your input file as Unicode-encoded text, you should open it in text mode. This will return a TextIOWrapper which handles the decoding and line splitting properly. |
|||
| msg189675 - (view) | Author: Michael Fox (Michael.Fox) | Date: 2013年05月20日 16:41 | |
You're right. In fact, what doesn't make sense is to be doing
line-oriented reads on a binary file. Why was I doing that?
I do have another quibble though. The open() function is like this:
open(file, mode='r', buffering=-1, encoding=None,
errors=None, newline=None, closefd=True, opener=None) -> file object
The lzma.open() function is like this:
lzma.open = open(filename, mode='rb', *, format=None, check=-1,
preset=None, filters=None, encoding=None, errors=None, newline=None)
It seems to me that it would be best for them to be as congruent as
possible. Because people will try to do this (I did):
if filename.endswith('.xz'):
f = lzma.open(filename)
else:
f = open(filename)
for line in f: ...
And then they will be in for a surprise. Would you consider changing
the default mode of lzma.open() to 'rt' and implement the 'buffering'
parameter as it is implemented in open()? And further, can we discuss
whether "duck typing" is becoming generally problematic in an
expanding standard library and whether there should be some process,
language, testing or something to ensure the ducks really quack the
same?
For example, there could be a standard testsuite which everything
purporting to implement an open() function should be subject to.
On Mon, May 20, 2013 at 7:42 AM, Nadeem Vawda <report@bugs.python.org> wrote:
>
> Nadeem Vawda added the comment:
>
> No, that is the intended behavior for binary streams - they operate at
> the level of individual byes. If you want to treat your input file as
> Unicode-encoded text, you should open it in text mode. This will return a
> TextIOWrapper which handles the decoding and line splitting properly.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue18003>
> _______________________________________
--
-
Michael
|
|||
| msg189676 - (view) | Author: Michael Fox (Michael.Fox) | Date: 2013年05月20日 16:50 | |
I thought of an even more hazardous case: if compression == 'gz': import gzip open = gzip.open elif compression == 'xz': import lzma open = lzma.open else: pass On Mon, May 20, 2013 at 9:41 AM, Michael Fox <report@bugs.python.org> wrote: > > Michael Fox added the comment: > > You're right. In fact, what doesn't make sense is to be doing > line-oriented reads on a binary file. Why was I doing that? > > I do have another quibble though. The open() function is like this: > > open(file, mode='r', buffering=-1, encoding=None, > errors=None, newline=None, closefd=True, opener=None) -> file object > > The lzma.open() function is like this: > > lzma.open = open(filename, mode='rb', *, format=None, check=-1, > preset=None, filters=None, encoding=None, errors=None, newline=None) > > It seems to me that it would be best for them to be as congruent as > possible. Because people will try to do this (I did): > > if filename.endswith('.xz'): > f = lzma.open(filename) > else: > f = open(filename) > for line in f: ... > > And then they will be in for a surprise. Would you consider changing > the default mode of lzma.open() to 'rt' and implement the 'buffering' > parameter as it is implemented in open()? And further, can we discuss > whether "duck typing" is becoming generally problematic in an > expanding standard library and whether there should be some process, > language, testing or something to ensure the ducks really quack the > same? > > For example, there could be a standard testsuite which everything > purporting to implement an open() function should be subject to. > > On Mon, May 20, 2013 at 7:42 AM, Nadeem Vawda <report@bugs.python.org> wrote: >> >> Nadeem Vawda added the comment: >> >> No, that is the intended behavior for binary streams - they operate at >> the level of individual byes. If you want to treat your input file as >> Unicode-encoded text, you should open it in text mode. This will return a >> TextIOWrapper which handles the decoding and line splitting properly. >> >> ---------- >> >> _______________________________________ >> Python tracker <report@bugs.python.org> >> <http://bugs.python.org/issue18003> >> _______________________________________ > > -- > > - > Michael > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue18003> > _______________________________________ -- - Michael |
|||
| msg189679 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年05月20日 17:32 | |
Wrapping a raw LZMAFile in a BufferedReader is a simple solution. But I think about extending BufferedReader so that LZMAFile and BufferedReader could use a common buffer. Perhaps add a new method to BufferedIOBase which will be called when a buffer is underflowed. In BufferedIOBase it should call raw.read(), in LZMAFile it should call _fill_buffer(). |
|||
| msg189680 - (view) | Author: Michael Fox (Michael.Fox) | Date: 2013年05月20日 17:34 | |
I thought about it some more and the only bug here is mine, failing to explicitly set mode='rt'. Maybe back when someone invented text and binary modes they should have been clear which was to be the default for all things. Maybe when someone made the base class, io.IOBase they should have defined an .open() with a mode that matched open(). Maybe when someone first implemented gzip.open() they should have matched open(). But that's not what happened and there's going to be lots of code out there relying on the default 'rt' for open() and 'rb' for gzip/bz2/lzma.open(). There's going to be lots of bugs in the future as people familiar with one thing assume the default is the same for the other. But oh well. It's too late change. On Mon, May 20, 2013 at 9:50 AM, Michael Fox <report@bugs.python.org> wrote: > > Michael Fox added the comment: > > I thought of an even more hazardous case: > > if compression == 'gz': > import gzip > open = gzip.open > elif compression == 'xz': > import lzma > open = lzma.open > else: > pass > > On Mon, May 20, 2013 at 9:41 AM, Michael Fox <report@bugs.python.org> wrote: >> >> Michael Fox added the comment: >> >> You're right. In fact, what doesn't make sense is to be doing >> line-oriented reads on a binary file. Why was I doing that? >> >> I do have another quibble though. The open() function is like this: >> >> open(file, mode='r', buffering=-1, encoding=None, >> errors=None, newline=None, closefd=True, opener=None) -> file object >> >> The lzma.open() function is like this: >> >> lzma.open = open(filename, mode='rb', *, format=None, check=-1, >> preset=None, filters=None, encoding=None, errors=None, newline=None) >> >> It seems to me that it would be best for them to be as congruent as >> possible. Because people will try to do this (I did): >> >> if filename.endswith('.xz'): >> f = lzma.open(filename) >> else: >> f = open(filename) >> for line in f: ... >> >> And then they will be in for a surprise. Would you consider changing >> the default mode of lzma.open() to 'rt' and implement the 'buffering' >> parameter as it is implemented in open()? And further, can we discuss >> whether "duck typing" is becoming generally problematic in an >> expanding standard library and whether there should be some process, >> language, testing or something to ensure the ducks really quack the >> same? >> >> For example, there could be a standard testsuite which everything >> purporting to implement an open() function should be subject to. >> >> On Mon, May 20, 2013 at 7:42 AM, Nadeem Vawda <report@bugs.python.org> wrote: >>> >>> Nadeem Vawda added the comment: >>> >>> No, that is the intended behavior for binary streams - they operate at >>> the level of individual byes. If you want to treat your input file as >>> Unicode-encoded text, you should open it in text mode. This will return a >>> TextIOWrapper which handles the decoding and line splitting properly. >>> >>> ---------- >>> >>> _______________________________________ >>> Python tracker <report@bugs.python.org> >>> <http://bugs.python.org/issue18003> >>> _______________________________________ >> >> -- >> >> - >> Michael >> >> ---------- >> >> _______________________________________ >> Python tracker <report@bugs.python.org> >> <http://bugs.python.org/issue18003> >> _______________________________________ > > -- > > - > Michael > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue18003> > _______________________________________ -- - Michael |
|||
| msg189922 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2013年05月24日 16:57 | |
A higher-level interface to abstract differences between gzip, xz and others is actually provided in the tarfile module. (zipfile is left out and its file objects have different methods, but that’s another issue. shutil provides even higher-level functions to work on top of tarfile or zipfile.) |
|||
| msg198144 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年09月20日 12:41 | |
See issue19051. Even preliminary Python implementation noticeable speed up the reading of short lines. $ ./python -m timeit -s "import lzma, io" "f=lzma.LZMAFile('words.xz', 'r')" "for line in f: pass" Unpatched: 1.44 sec per loop Patched: 1.06 sec per loop With C implementation it should be as fast as with BufferedReader. |
|||
| msg198146 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2013年09月20日 12:49 | |
> With C implementation it should be as fast as with BufferedReader. So why not simply use BufferedReader? |
|||
| msg198160 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年09月20日 18:49 | |
> So why not simply use BufferedReader? Because we want good performance LZMAFile and compatibility with older versions. And I guess that it will be even faster than wrapping in BufferedReader (due to the avoiding of double buffering). |
|||
| msg198161 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2013年09月20日 18:53 | |
> > So why not simply use BufferedReader? > > Because we want good performance LZMAFile and compatibility with older > versions. You're reading me wrong. I'm simply suggesting that users interested in readline() performance wrap LZMAFile in a BufferedReader. The documentation can mention it. > And I guess that it will be even faster than wrapping in > BufferedReader (due to the avoiding of double buffering). Let's wait for the numbers, then. The performance increase would have to be quite large to justify such code duplication. |
|||
| msg233870 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年01月12日 01:35 | |
I haven’t done any tests, but my LZMAFile patch to Issue 15955 uses BufferedReader, so it might satisfy this issue |
|||
| msg244582 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年06月01日 13:51 | |
This bug was originally raised against Python 3.3, and the speed has improved a lot since then. Perhaps this bug can be closed as it is, or maybe people would like to consider my decomp-optim.patch which squeezes a bit more speed out. I don’t actually have a strong opinion either way. Python 3.4 was apparently much faster than 3.3 courtesy of Issue 16034. In Python 3.5, all three decompression modules (LZMA, gzip and bzip) now use a BufferedReader internally, due to my work in Issue 23529. The modules delegate method calls to the internal BufferedReader, rather than returning an instance directly, for backwards compatibility. I found that bypassing the readline() delegation speeds things up significantly, and adding a custom "closed" property on the underlying raw reader class also helps. However, I did not think it would be wise to bypass the locking in the "bz2" module, I didn’t bypass BZ2File.readline() in the patch. Timing results and a test script I used to investigate different options below: lzma gzip bz2 ======= ======== ======== Unpatched 3.2 s 2.513 s 5.180 s Custom __iter__() 1.31 s 1.317 s 2.433 s __iter__() and closed 0.53 s* 0.543 s* 1.650 s closed change only 4.047 s* External BufferedReader 0.64 s 0.597 s 1.750 s Direct from BytesIO 0.33 s 0.370 s 1.280 s Command-line tool 0.063 s 0.053 s 0.993 s * Option implemented in decomp-optim.patch --- import lzma, io filename = "pacman.log.xz" # 256206 lines; 389 kB -> 13 MB # Basic case reader = lzma.LZMAFile(filename) # 3.2 s # Add __iter__() optimization def lzma_iter(self): self._check_can_read() return iter(self._buffer) lzma.LZMAFile.__iter__ = lzma_iter # 1.31 s # Add "closed" optimization def decompressor_closed(self): return self._decompressor is None import _compression _compression.DecompressReader.closed = property(decompressor_closed) # 0.53 s #~ # External BufferedReader baseline #~ reader = io.BufferedReader(lzma.LZMAFile(filename)) # 0.64 s #~ # Direct from BytesIO baseline #~ with open(filename, "rb") as file: #~ data = file.read() #~ reader = io.BytesIO(lzma.decompress(data)) # 0.33 s for line in reader: pass |
|||
| msg244691 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2015年06月02日 21:22 | |
This looks good to me. Larry would probably have to validate it for 3.5, although we may try to sneak it in (he isn't reading :-D). |
|||
| msg244696 - (view) | Author: Larry Hastings (larry) * (Python committer) | Date: 2015年06月02日 21:51 | |
Quoi? Je comprends que le français. |
|||
| msg244698 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2015年06月02日 21:51 | |
Nous disions que tu aurais probablement à valider ce changement, mais que nous pourrions peut-être aussi le faufiler discrètement dans la base de code, vu que tu ne lis pas ces message. |
|||
| msg244702 - (view) | Author: Larry Hastings (larry) * (Python committer) | Date: 2015年06月02日 22:09 | |
If I understand this correctly, I can ignore everything up to May 2015, as it has to do with line-reading a compressed binary file (!) being slow. Then, Martin Panter proposes a new optimization in May 2015, which is to simply add __iter__ methods to gzip.GzipFile and lzma.LZMAFile. Is this right? |
|||
| msg244729 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2015年06月03日 07:19 | |
Yes, this is right. |
|||
| msg244734 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年06月03日 10:04 | |
Yes that’s basically right Larry. The __iter__() was previously inherited; now I am overriding it with a custom version. Similarly for the "closed" property, but that one is only a member of objects internal to the gzip, lzma and bz2 modules. |
|||
| msg244735 - (view) | Author: Larry Hastings (larry) * (Python committer) | Date: 2015年06月03日 10:20 | |
I don't see anything about "closed" in the patch you posted. |
|||
| msg244738 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年06月03日 11:07 | |
Looking at <https://bugs.python.org/file39586/decomp-optim.patch>, the "closed" property is the first of the three hunks: 1. Adds @property / def closed(self) to Lib/_compression.py 2. Adds def __iter__(self) to Lib/gzip.py 3. Adds def __iter__(self) to Lib/lzma.py |
|||
| msg244739 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年06月03日 11:12 | |
New patch just fixes the spelling error in the comment. |
|||
| msg244904 - (view) | Author: Larry Hastings (larry) * (Python committer) | Date: 2015年06月06日 13:08 | |
A small last-minute optimization is not a release-blocker. |
|||
| msg244909 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年06月06日 14:19 | |
bz2 will gain great benefit from such optimization too.
Microbenchmark results:
$ ./python -m timeit -s "import gzip" -- "f=gzip.GzipFile('words.gz', 'r')" "for line in f: pass"
2.7: 10 loops, best of 3: 374 msec per loop
3.2: 10 loops, best of 3: 325 msec per loop
3.3: 10 loops, best of 3: 311 msec per loop
3.4: 10 loops, best of 3: 328 msec per loop
3.5: 10 loops, best of 3: 325 msec per loop
3.5+decomp-optim.v3: 10 loops, best of 3: 61.2 msec per loop
$ ./python -m timeit -s "import bz2" -- "f=bz2.BZ2File('words.bz2', 'r')" "for line in f: pass"
2.7: 10 loops, best of 3: 92.1 msec per loop
3.2: 10 loops, best of 3: 92.4 msec per loop
3.3: 10 loops, best of 3: 567 msec per loop
3.4: 10 loops, best of 3: 535 msec per loop
3.5: 10 loops, best of 3: 603 msec per loop
3.5+decomp-optim.v2: 10 loops, best of 3: 525 msec per loop
3.5+decomp-optim.v3: 10 loops, best of 3: 131 msec per loop
$ python -m timeit -s "import lzma" -- "f=lzma.LZMAFile('words.xz', 'r')" "for line in f: pass"
2.7: 10 loops, best of 3: 49.4 msec per loop
3.3: 10 loops, best of 3: 1.67 sec per loop
3.4: 10 loops, best of 3: 400 msec per loop
3.5: 10 loops, best of 3: 423 msec per loop
3.5+decomp-optim.v3: 10 loops, best of 3: 89.6 msec per loop
The fact that bz2 and lzma have 5-15% regression in 3.5 (comparing to 3.4) makes applying this patch to 3.5 more desirable.
|
|||
| msg244967 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2015年06月07日 18:58 | |
This looks good to me. |
|||
| msg244969 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年06月07日 19:23 | |
Perhaps this change is worth to mention in whatsnews. Could you add this Martin? It would be nice also add tests to ensure that next() after closing the file always raises ValueError. |
|||
| msg245064 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年06月09日 13:41 | |
This patch adds an entry to the What’s New for 3.5 (though maybe it will have to be 3.6), and adds three tests to check that next() raises ValueError when the files have been closed. |
|||
| msg245066 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年06月09日 14:51 | |
Larry, do you accept the patch for 3.5? |
|||
| msg245067 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2015年06月09日 15:16 | |
He accepted it already: """A small last-minute optimization is not a release-blocker.""" |
|||
| msg245090 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年06月09日 20:16 | |
The patch is not so harmless. First, my change in BZ2File is not correct, because reading every line should be guarded with a lock (BZ2File is threading-safe). Second, for now all three compressing files are not only iterables, but iterators. iter(f) returns f, and changing this can have non-obvious effects. I think the patch is too complex for 3.5, we should have more time to analyze all consequences. |
|||
| msg245096 - (view) | Author: Larry Hastings (larry) * (Python committer) | Date: 2015年06月09日 22:06 | |
Sounds good to me. |
|||
| msg245107 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年06月10日 03:07 | |
The BufferedReader class is documented as being thread safe: <https://docs.python.org/dev/library/io.html#multi-threading>. Some experimentation suggests that checking the "raw.closed" property is not actually serialized, but that raw.readinto() calls are serialized. I don’t think this is a big problem in practice, so I think BZ2File would remain as thread-safe as BufferedReader is. But Serhiy’s point is definitely valid about the classes breaking the iterator protocol. FWIW I originally made this patch to satisfy my personal curiosity about why wrapping a second BufferedReader made things so much faster. Now I accept it is partly due to the overhead of the extra LZMAFile.readline() to BufferedReader.readline() delegation. Maybe it is not even worth optimizing around this kind of overhead, so I would even be happy to drop this patch and close the issue. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:45 | admin | set | github: 62203 |
| 2015年06月10日 03:07:25 | martin.panter | set | messages: + msg245107 |
| 2015年06月09日 22:06:58 | larry | set | messages: + msg245096 |
| 2015年06月09日 20:16:29 | serhiy.storchaka | set | stage: commit review -> patch review messages: + msg245090 versions: - Python 3.5 |
| 2015年06月09日 15:16:13 | pitrou | set | messages: + msg245067 |
| 2015年06月09日 14:51:16 | serhiy.storchaka | set | messages: + msg245066 |
| 2015年06月09日 13:41:12 | martin.panter | set | files:
+ decomp-optim.v4.patch messages: + msg245064 |
| 2015年06月07日 19:23:37 | serhiy.storchaka | set | messages: + msg244969 |
| 2015年06月07日 18:58:17 | pitrou | set | messages:
+ msg244967 stage: patch review -> commit review |
| 2015年06月06日 14:19:58 | serhiy.storchaka | set | files:
+ decomp-optim.v3.patch messages: + msg244909 |
| 2015年06月06日 13:08:49 | larry | set | priority: release blocker -> normal messages: + msg244904 |
| 2015年06月06日 13:05:19 | pitrou | set | priority: normal -> release blocker |
| 2015年06月03日 11:12:52 | martin.panter | set | files:
+ decomp-optim.v2.patch messages: + msg244739 stage: needs patch -> patch review |
| 2015年06月03日 11:07:23 | martin.panter | set | messages: + msg244738 |
| 2015年06月03日 10:20:08 | larry | set | messages: + msg244735 |
| 2015年06月03日 10:04:32 | martin.panter | set | messages: + msg244734 |
| 2015年06月03日 07:19:57 | pitrou | set | messages: + msg244729 |
| 2015年06月02日 22:09:35 | larry | set | messages: + msg244702 |
| 2015年06月02日 21:51:50 | pitrou | set | messages: + msg244698 |
| 2015年06月02日 21:51:00 | larry | set | messages: + msg244696 |
| 2015年06月02日 21:22:40 | pitrou | set | nosy:
+ larry messages: + msg244691 |
| 2015年06月01日 13:51:15 | martin.panter | set | files:
+ decomp-optim.patch keywords: + patch messages: + msg244582 versions: + Python 3.5, Python 3.6, - Python 3.4 |
| 2015年01月12日 01:35:50 | martin.panter | set | nosy:
+ martin.panter messages: + msg233870 |
| 2013年09月20日 18:53:39 | pitrou | set | messages: + msg198161 |
| 2013年09月20日 18:49:32 | serhiy.storchaka | set | messages: + msg198160 |
| 2013年09月20日 12:49:38 | pitrou | set | messages: + msg198146 |
| 2013年09月20日 12:41:11 | serhiy.storchaka | set | messages: + msg198144 |
| 2013年05月24日 16:57:47 | eric.araujo | set | nosy:
+ eric.araujo messages: + msg189922 title: New lzma crazy slow with line-oriented reading. -> lzma module very slow with line-oriented reading. |
| 2013年05月20日 17:34:46 | Michael.Fox | set | messages: + msg189680 |
| 2013年05月20日 17:32:39 | serhiy.storchaka | set | messages: + msg189679 |
| 2013年05月20日 16:50:52 | Michael.Fox | set | messages: + msg189676 |
| 2013年05月20日 16:41:58 | Michael.Fox | set | messages: + msg189675 |
| 2013年05月20日 14:42:43 | nadeem.vawda | set | messages: + msg189668 |
| 2013年05月20日 14:28:26 | Michael.Fox | set | messages: + msg189665 |
| 2013年05月19日 22:28:19 | Arfrever | set | nosy:
+ Arfrever |
| 2013年05月19日 18:50:56 | nadeem.vawda | set | messages: + msg189616 |
| 2013年05月19日 17:52:20 | nadeem.vawda | set | messages: + msg189611 |
| 2013年05月19日 16:49:14 | Michael.Fox | set | messages: + msg189605 |
| 2013年05月19日 14:07:00 | pitrou | set | nosy:
+ pitrou messages: + msg189592 |
| 2013年05月19日 09:02:53 | serhiy.storchaka | set | messages: + msg189575 |
| 2013年05月19日 03:47:34 | rhettinger | set | assignee: serhiy.storchaka stage: needs patch messages: + msg189568 versions: + Python 3.4, - Python 3.3 |
| 2013年05月19日 03:38:49 | rhettinger | set | nosy:
+ rhettinger messages: + msg189567 |
| 2013年05月18日 22:11:54 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg189553 |
| 2013年05月18日 21:48:22 | Michael.Fox | set | messages: + msg189552 |
| 2013年05月18日 21:12:40 | Michael.Fox | set | messages: + msg189550 |
| 2013年05月18日 19:46:33 | nadeem.vawda | set | messages: + msg189542 |
| 2013年05月17日 22:52:34 | vstinner | set | nosy:
+ vstinner |
| 2013年05月17日 22:27:24 | Michael.Fox | create | |