homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib fail to read URL contents, urllib2 crash Python
Type: crash Stage:
Components: None Versions: Python 2.5
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: cosoleto, ggenellina, gvanrossum, jjlee, josm, orsenthil, torriem
Priority: normal Keywords:

Created on 2007年09月26日 07:55 by cosoleto, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
httplib.diff josm, 2007年09月28日 03:18
httplib.py.diff josm, 2007年12月01日 12:16
Messages (13)
msg56143 - (view) Author: Francesco Cosoleto (cosoleto) Date: 2007年09月26日 07:55
urllib fail to read URL contents, urllib2 crash Python
Python version:
-------------------------
Python 2.5.1 (r251:54863, May 18 2007, 16:56:43) 
[GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)]
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit 
(Intel)] on
win32
Python 2.4.4 (#2, Aug 16 2007, 00:34:54) 
[GCC 4.1.3 20070812 (prerelease) (Debian 4.1.2-15)] on linux2
-------------------------
Working with GNU wget:
-------------------------
$ wget -S http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud
--08:42:21-- http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud
 => `Thomas-Robert_Bugeaud'
Risoluzione di www.recherche.fr in corso... 88.191.11.214
Connessione a www.recherche.fr|88.191.11.214:80... connesso.
HTTP richiesta inviata, aspetto la risposta... 
 HTTP/1.1 200 OK
 Date: 2007年9月26日 06:42:53 GMT
 Server: Apache/2.2.3 (Debian) PHP/5.2.3-0.dotdeb.1 with Suhosin-Patch
 X-Powered-By: PHP/5.2.3-0.dotdeb.1
 Keep-Alive: timeout=15, max=100
 Connection: Keep-Alive
 Transfer-Encoding: chunked
 Content-Type: text/html; charset=UTF-8
Lunghezza: non specificato [text/html]
 [ <=> ] 
267,080 --.--K/s 
08:42:42 (14.11 KB/s) - "Thomas-Robert_Bugeaud" salvato [267080]
-------------------------
Python:
-------------------------
>>> import urllib
>>> a = urllib.urlopen('http://www.recherche.fr/encyclopedie/Thomas-
Robert_Bugeaud')
>>> c = a.read(1024*1024*2)
>>> len(c) 
1035220
>>> c[63000:64000]
'he.fr en page d\'accueil</a><br>\n <span>Partenaires :</span> <a 
href="http://www.cartes.fr/" target="_blank">Cartes\n 
postales</a>&nbsp; <a href="http://www.deux.fr/script/" 
target="_blank">Rencontres\n gratuites\n </a>&nbsp; <a 
href="http://www.new.fr/" target="_blank">Noms\n de domaine 
gratuits</a>&nbsp; <a href="http://www.netencyclo.com/" 
target="_blank">Encyclopedia</a>&nbsp;</p>\n <p style="text-
align:center;"><a href="http://www.futureobject.com/" 
target="_blank"><img src="http://www.recherche.fr/images/logo_fo.gif" 
border="0" height="25" width="96"></a></p>\n\n </p>\n </div>\n 
</div><!-- site -->\n</body>\n</html>\n\r\n\x00\x00\x00\x00\x00\x00\x00
\x00\x00[...omission...]\x00\x00\x00\x00'
-------------------------
As above, but with urllib2 module instead of urllib:
-------------------------
 File "/usr/lib/python2.5/socket.py", line 291, in read
 data = self._sock.recv(recv_size)
 File "/usr/lib/python2.5/httplib.py", line 509, in read
 return self._read_chunked(amt)
 File "/usr/lib/python2.5/httplib.py", line 548, in _read_chunked
 chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: '\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00[...omission...]\x00\x00\x00\x00\x00\x00\x00
\
-------------------------
As above, but with Python 2.4:
-------------------------
>>> import urllib2
>>> a = urllib2.urlopen('http://www.recherche.fr/encyclopedie/Thomas-
Robert_Bugeaud')
>>> 
>>> c = a.read(1024*1024*2)
Traceback (most recent call last):
 File "<stdin>", line 1, in ?
 File "/usr/lib/python2.4/socket.py", line 295, in read
 data = self._sock.recv(recv_size)
 File "/usr/lib/python2.4/httplib.py", line 460, in read
 return self._read_chunked(amt)
 File "/usr/lib/python2.4/httplib.py", line 499, in _read_chunked
 chunk_left = int(line, 16)
ValueError: invalid literal for int(): 
-------------------------
Regards,
Francesco Cosoleto
msg56144 - (view) Author: Gabriel Genellina (ggenellina) Date: 2007年09月26日 14:07
This is a server bug. Internet Explorer 6 can't show the page either. 
The response is malformed; it uses chunked transfer, and RFC2616 
section 3.6.1 says "The chunk-size field is a string of hex digits 
indicating the size of the chunk. The chunked encoding is ended by any 
chunk whose size is zero[...]"
After the (first and only) chunk of around 63K, should come a 0-length 
chunk: a line with one or more digits "0" followed by CR+LF. But the 
server is not sending that last chunk, instead it sends lots of nul 
bytes, until eventually a CR,LF sequence arrives.
Neither IE nor Python can handle that (IE keeps requesting the page 
again and again). wget is apparently a lot more relaxed and decides 
that the first chunk is good enough. Perhaps urllib/urllib2 could 
handle the error and raise a more meaningful exception in this case, 
but just ignoring the error doesn't appear to be the right thing IMHO.
msg56147 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007年09月26日 16:55
Maybe the French internet is incompatible with the rest of the world? :-)
msg56151 - (view) Author: John Smith (josm) Date: 2007年09月27日 03:21
Firefox 2.0.0.7 and Safari 2.0.4 can who this page.
In my opinion, Python urllib should be more practical and
provide a way to read this kind of page.
"In general, an implementation must be conservative
in its sending behavior, and liberal in its receiving behavior."
[RFC 791 3.2]
msg56162 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007年09月27日 14:31
> In my opinion, Python urllib "should" be more practical and
> provide a way to read this kind of page. [quotes mine]
Totally agreed. Someone "should" submit a patch.
msg56183 - (view) Author: John Smith (josm) Date: 2007年09月28日 03:18
Attached a patch for this problem.
This one just ignores the buggy chunk-size and close the connection.
As gagenellina said earlier, this might not be a good way
to fix this, but I could not come up with better solution.
msg56506 - (view) Author: Michael Torrie (torriem) Date: 2007年10月16日 19:11
I had a situation where I was talking to a Sharp MFD printer. Their web
server apparently does not serve chunked data properly. However the
patch posted here put it in an infinite loop.
Somewhere around line 525 in the python 2.4 version of httplib.py, I had
to make it look like this:
 while True:
 line = self.fp.readline()
 if line == '\r\n' or not line:
 break
I added "or not line" to the if statement. The blank line in the
chunked http was confusing the _last_chunk thing, but even when it was
set to zero, since there was no more data, this loop to eat up crlfs was
never ending.
Is this really a proper fix? 
I'm in favor of changing urllib2 to be less strict because, despite the
RFCs, we're stuck talking to all kinds of web servers (embedded ones in
particular) that simply can't easily be changed.
msg58042 - (view) Author: John Smith (josm) Date: 2007年12月01日 12:16
included torriem's fix.
IMHO, there is no clear solution for this
because this is due to HTTP server's "bug"
and a bug is the one that you can't predict accurately...
msg58999 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007年12月26日 16:46
Irrespective of the patch, this issue is reproducable with the code in the
trunk for Python 2.6. Should we close this then?
n 2.6a0 (trunk:59600M, Dec 25 2007, 13:54:34)
[GCC 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> import urllib
>>> url = "http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud"
>>> a = urllib.urlopen(url)
>>> b = urllib2.urlopen(url)
>>> c = a.read(1024 * 1024 * 2)
>>> c[63000:64000]
'UA-321207-2";\nurchinTracker();\n</script>\n <div id="introFin">\n <p>\nLe
contenu de cette page (Thomas-Robert Bugeaud) est un minuscule extrait de
l\'encyclopi\xc3\xa9die gratuite en ligne <a
href="http://fr.wikipedia.org">WIKIPEDIA</a>\nle webmaster de ce site n\'est
pas l\'auteur de cet article (Thomas-Robert Bugeaud). Vous pouvez retrouver
l\'original de cet article (Thomas-Robert Bugeaud) &agrave; <a
href="http://fr.wikipedia.org/wiki/Thomas-Robert_Bugeaud">cette adresse</a> et
la liste des auteurs <a
href="http://fr.wikipedia.org/w/index.php?title=Thomas-Robert_Bugeaud&amp;action=history">ici</a>\nVous
pouvez <a
href="http://fr.wikipedia.org/w/index.php?title=Thomas-Robert_Bugeaud&amp;action=edit">modifier
ou compl\xc3\xa9ter</a> cet article mais \xc3\xa9galement <a
href="http://fr.wikipedia.org/w/index.php?title=Discuter:Thomas-Robert_Bugeaud&amp;action=edit">discuter</a>
de son contenu (Thomas-Robert Bugeaud) sur le site de <a
href="http://fr.wikipedia.org">WIKIPEDIA France</a> - Contenu (Thomas-Robert B'
>>> c = b.read(1024 * 1024 * 2)
>>> c[63000:64000]
'acct = "UA-321207-2";\nurchinTracker();\n</script>\n <div id="introFin">\n
<p>\nLe contenu de cette page (Thomas-Robert Bugeaud) est un minuscule extrait
de l\'encyclopi\xc3\xa9die gratuite en ligne <a
href="http://fr.wikipedia.org">WIKIPEDIA</a>\nle webmaster de ce site n\'est
pas l\'auteur de cet article (Thomas-Robert Bugeaud). Vous pouvez retrouver
l\'original de cet article (Thomas-Robert Bugeaud) &agrave; <a
href="http://fr.wikipedia.org/wiki/Thomas-Robert_Bugeaud">cette adresse</a> et
la liste des auteurs <a
href="http://fr.wikipedia.org/w/index.php?title=Thomas-Robert_Bugeaud&amp;action=history">ici</a>\nVous
pouvez <a
href="http://fr.wikipedia.org/w/index.php?title=Thomas-Robert_Bugeaud&amp;action=edit">modifier
ou compl\xc3\xa9ter</a> cet article mais \xc3\xa9galement <a
href="http://fr.wikipedia.org/w/index.php?title=Discuter:Thomas-Robert_Bugeaud&amp;action=edit">discuter</a>
de son contenu (Thomas-Robert Bugeaud) sur le site de <a
href="http://fr.wikipedia.org">WIKIPEDIA France</a> - Contenu (Thomas-'
>>>
msg59000 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007年12月26日 16:49
> 
> Senthil added the comment:
> 
> Irrespective of the patch, this issue is reproducable with the code in the
> trunk for Python 2.6. Should we close this then?
> __________________________________
Sorry, I meant to say "NOT Reproducable".
msg59117 - (view) Author: Francesco Cosoleto (cosoleto) Date: 2008年01月03日 01:15
Sorry, but I don't understand reason to close this issue with 
resolution "wont fix". The problem was reproducible and his logic 
explained by more developers. If the problem has been resolved, then, 
please, change "resolution" field to "fixed", else a patch request is 
pending (see msg56162). No? :-( Of course - it was predictable - the 
bug isn't reproducible now also using previous Python version: 
$ wget -c http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud
[..omisss..]
02:08:34 (4.28 KB/s) - "Thomas-Robert_Bugeaud" salvato [65107] 
----
Python 2.5.1 (r251:54863, May 18 2007, 16:56:43) 
>>> url = "http://www.recherche.fr/encyclopedie/Thomas-Robert_Bugeaud"
>>> a = urllib.urlopen(url) ; c = a.read(1024 * 1024 * 2)
>>> len(c)
65169
msg59118 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008年01月03日 01:50
I'm just following the last post's suggestion "Should we close this then?"
My message (somebody "should" submit a patch) was sarcastic --- it was
in reference to the comment that Python "should" be more practical.
Since no patch was applied, I don't know why "won't fix" isn't a
perfectly adequate description of the reason for closure.
If you want me to reopen this, please submit a patch.
msg76743 - (view) Author: John J Lee (jjlee) Date: 2008年12月02日 14:13
This is fixed in trunk r61034 by issue #900744 . Please use that issue
for any discussion re whether this should be fixed in 2.5.
History
Date User Action Args
2022年04月11日 14:56:27adminsetgithub: 45546
2008年12月02日 14:13:43jjleesetnosy: + jjlee
messages: + msg76743
2008年01月03日 01:50:08gvanrossumsetmessages: + msg59118
2008年01月03日 01:15:02cosoletosetmessages: + msg59117
2008年01月02日 23:16:53gvanrossumsetstatus: open -> closed
resolution: wont fix
2007年12月26日 16:49:23orsenthilsetmessages: + msg59000
2007年12月26日 16:46:36orsenthilsetnosy: + orsenthil
messages: + msg58999
2007年12月01日 12:16:34josmsetfiles: + httplib.py.diff
messages: + msg58042
2007年10月16日 19:11:29torriemsetnosy: + torriem
messages: + msg56506
2007年09月28日 03:18:04josmsetfiles: + httplib.diff
messages: + msg56183
2007年09月27日 14:31:23gvanrossumsetmessages: + msg56162
2007年09月27日 03:21:30josmsetnosy: + josm
messages: + msg56151
2007年09月26日 16:55:10gvanrossumsetnosy: + gvanrossum
messages: + msg56147
2007年09月26日 14:07:54ggenellinasetnosy: + ggenellina
messages: + msg56144
2007年09月26日 07:55:32cosoletocreate

AltStyle によって変換されたページ (->オリジナル) /