homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib.urlopen results.readline is slow
Type: Stage:
Components: Library (Lib) Versions: Python 2.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: gstein Nosy List: ajaksu2, akuchling, gstein, gvanrossum, kbdavidson, nobody, reacocard
Priority: normal Keywords:

Created on 2002年01月24日 21:48 by kbdavidson, last changed 2022年04月10日 16:04 by admin. This issue is now closed.

Messages (9)
msg8975 - (view) Author: Keith Davidson (kbdavidson) Date: 2002年01月24日 21:48
The socket file object underlying the return from 
urllib.urlopen() is opened without any buffering 
resulting in very slow performance of results.readline
(). The specific problem is in the 
httplib.HTTPResponse constructor. It calls 
sock.makefile() with a 0 for the buffer size. Forcing 
the buffer size to 4096 results in the time for 
calling readline() on a 60K character line to go from 
16 seconds to .27 seconds (there is other processing 
going on here but the magnitude of the difference is 
correct).
I am using Python 2.0 so I can not submit a patch 
easily but the problem appears to still be present in 
the 2.2 source. The specific change is to change the 
0 in sock.makefile() to 4096 or some other reasonable 
buffer size:
class HTTPResponse:
 def __init__(self, sock, debuglevel=0):
 self.fp = sock.makefile('rb', 0) <= change 
to 4096
 self.debuglevel = debuglevel
msg8976 - (view) Author: Nobody/Anonymous (nobody) Date: 2002年01月24日 21:54
Logged In: NO 
What platform?
--Guido (not logged in)
msg8977 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002年01月25日 14:12
Logged In: YES 
user_id=6380
I wonder why the author explicitly turned off buffering.
There probably was a reason? Without knowing why, we can't
just change it.
msg8978 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2002年03月14日 23:32
Logged In: YES 
user_id=11375
Greg Stein originally wrote it; I'll ping him.
I suspect it might be because of 
HTTP pipelining; if multiple
responses will be returned over a socket, you 
probably can't use buffering because the buffer might consume the end of 
response #1 and the start of response #2. 
msg8979 - (view) Author: Greg Stein (gstein) * (Python committer) Date: 2002年03月18日 07:05
Logged In: YES 
user_id=6501
Andrew is correct. The buffering was turned off
(specifically) so that the reading of one response will not
consume a portion of the next response.
Jeremy first found the over-reading problem a couple years
ago, and we solved the problem then. To read the thread:
http://mail.python.org/pipermail/python-dev/2000-June/004409.html
After the HTTP response's headers have been read, then it
can be determined whether the connection will be closed at
the end of the response, or whether it will stay open for
more requests to be performed. If it is going to be closed,
then it is possible to use buffering. Of course, that is
*after* the headers, so you'd actually need to do a second
dup/makefile and turn on buffering. This also means that you
wouldn't get the buffering benefits while reading headers.
It could be possible to redesign the connection/response
classes to keep a buffer in the connection object, but that
is quite a bit more involved. It also complicates the
passing of the socket to the response object in some cases.
I'm going to close this as "invalid" since the proposed fix
would break the code.
msg65019 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2008年04月06日 04:46
Well, this issue is still hurting performance, the most recent example
was with a developer of a download manager.
I suggest adding a buffer size argument to HTTPResponse.__init__
(defaulting to zero), along with docs that explain the problems that may
arise from using a buffer. If there's any chance this might be accepted,
I'll write a patch.
msg65021 - (view) Author: Aren Olson (reacocard) Date: 2008年04月06日 06:07
I can indeed confirm that this change creates a HUGE speed difference.
Using the code found at [1] with python2.5 and apache2 under Ubuntu,
changing the buffer size to 4096 improved the time needed to download
10MB from 15.5s to 1.78s, almost 9x faster. Repeat downloads of the same
file (meaning the server now has the file cached in memory), yield times
of 15.5s and 0.03s, a 500x improvement. When fetching from a server on
the local network, rather than from localhost, these times become 15.5s
and 0.9s in both cases, a 17x speedup. Real-world situations will likely
be a mix of these, however it is safe to say the speed improvement will
be substantial. Adding an option to adjust the buffer size would be very
welcome, though the default value should still be zero, to avoid the
issues already mentioned.
[1] - http://pastebin.ca/973578 
msg65082 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008年04月07日 17:53
Please don't add to a closed issue that old. If you still have an issue
with this, please open a new issue. If you have a patch, kindly upload
it to the issue.
msg65122 - (view) Author: Aren Olson (reacocard) Date: 2008年04月07日 20:56
new issue: http://bugs.python.org/issue2576 
History
Date User Action Args
2022年04月10日 16:04:55adminsetgithub: 35974
2008年04月07日 20:56:23reacocardsetmessages: + msg65122
2008年04月07日 17:53:18gvanrossumsetmessages: + msg65082
2008年04月06日 06:07:37reacocardsetnosy: + reacocard
messages: + msg65021
2008年04月06日 04:46:54ajaksu2setnosy: + ajaksu2
messages: + msg65019
versions: + Python 2.6, - Python 2.2
2002年01月24日 21:48:44kbdavidsoncreate

AltStyle によって変換されたページ (->オリジナル) /