homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib2 can't handle http://www.wikispaces.com
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.0, Python 2.6, Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: facundobatista Nosy List: facundobatista, gregory.p.smith, jjlee, orsenthil, weijie90
Priority: low Keywords: patch

Created on 2008年03月23日 14:41 by weijie90, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue2464-facundo.diff facundobatista, 2008年08月16日 19:30
issue2464-py26-FINAL.diff orsenthil, 2008年08月17日 02:42
issue2464-PATCH1.diff orsenthil, 2008年08月17日 02:42
issue2463-py3k.diff orsenthil, 2008年08月17日 02:56
Messages (18)
msg64363 - (view) Author: Koh Wei Jie (weijie90) Date: 2008年03月23日 14:41
Try the following code:
import urllib2
gmail = urllib2.urlopen("https://www.gmail.com").read()
wikispaces = urllib2.urlopen("http://www.wikispaces.com").read()
Getting the html over HTTPS from gmail.com works, but not over HTTP from
wikispaces. Here's the traceback:
>>> wikispaces = urllib2.urlopen("http://www.wikispaces.com").read()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/lib/python2.5/urllib2.py", line 121, in urlopen
 return _opener.open(url, data)
 File "/usr/lib/python2.5/urllib2.py", line 380, in open
 response = meth(req, response)
 File "/usr/lib/python2.5/urllib2.py", line 491, in http_response
 'http', request, response, code, msg, hdrs)
 File "/usr/lib/python2.5/urllib2.py", line 412, in error
 result = self._call_chain(*args)
 File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
 result = func(*args)
 File "/usr/lib/python2.5/urllib2.py", line 575, in http_error_302
 return self.parent.open(new)
 File "/usr/lib/python2.5/urllib2.py", line 380, in open
 response = meth(req, response)
 File "/usr/lib/python2.5/urllib2.py", line 491, in http_response
 'http', request, response, code, msg, hdrs)
 File "/usr/lib/python2.5/urllib2.py", line 412, in error
 result = self._call_chain(*args)
 File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
 result = func(*args)
 File "/usr/lib/python2.5/urllib2.py", line 575, in http_error_302
 return self.parent.open(new)
 File "/usr/lib/python2.5/urllib2.py", line 374, in open
 response = self._open(req, data)
 File "/usr/lib/python2.5/urllib2.py", line 392, in _open
 '_open', req)
 File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
 result = func(*args)
 File "/usr/lib/python2.5/urllib2.py", line 1100, in http_open
 return self.do_open(httplib.HTTPConnection, req)
 File "/usr/lib/python2.5/urllib2.py", line 1075, in do_open
 raise URLError(err)
urllib2.URLError: <urlopen error (104, 'Connection reset by peer')>
Note the two 302 redirects.
I tried accessing wikispaces.com with SSL turned off in Firefox
2.0.0.12, which didn't work because SSL was required, perhaps in between
the redirects that wikispaces uses.
Why doesn't urllib2 handle the "hidden" SSL properly? (Not to be rude,
but httplib2 works.)
Thanks!
WJ
msg64676 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008年03月29日 03:13
The problem does not appear to have anything to do with SSL. The
problem is that the chain of HTTP requests goes:
GET -> 302 -> 302 -> 301
On the final 301 urllib2's internal state is messed up such that by the
time its in the handle_error_302 method it believes the Location header
contains the following:
'http://www.wikispaces.com/\x00/?responseToken=481aec3249f429316459e01c00b7e522'
The \x00 and everything after it should not be there and is not there if
you look at what is sent over the socket itself. The ?responseToken=xxx
value is leftover from the previous 302 response. No idea where the
\x00 came from yet. I'm debugging...
msg64677 - (view) Author: Koh Wei Jie (weijie90) Date: 2008年03月29日 03:41
Please take your time, because this bug isn't critical. Thanks!
msg64679 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008年03月29日 03:48
Instrumenting the code and looking closer at the tcpdump, its true. 
wikispaces.com is returning an invalid Location: header with a null byte
in the middle of it.
The "fix" on our end should be to handle such garbage from such broken
web servers more gracefully. Other clients seem to treat the null as an
end of string or end of that header.
msg64680 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008年03月29日 04:19
I'm not sure what the best solution for this is. If I truncate the
header values at a \x00 character it ends in an infinite redirect loop
(which urllib2 detects and raises on). If I simple remove all \x00
characters the resulting url is not accepted by wikispaces.com due to
having an extra / in it.
Verdict: wikispaces.com is broken.
urllib2 could do better. wget and firefox deal with it properly. but
i'll leave deciding which patch to use up to someone who cares about
handling broken sites.
patch to implement either behavior of dealing with nulls where they
shouldn't be:
Index: Lib/httplib.py
===================================================================
--- Lib/httplib.py (revision 62033)
+++ Lib/httplib.py (working copy)
@@ -291,9 +291,18 @@
 break
 headerseen = self.isheader(line)
 if headerseen:
+ # Some bad web servers reply with headers with a \x00 null
+ # embedded in the value. Other http clients deal with
+ # this by treating it as a value terminator, ignoring the
+ # rest so we will too. http://bugs.python.org/issue2464.
+ if '\x00' in line:
+ line = line[:line.find('\x00')]
+ # if you want to just remove nulls instead use this:
+ #line = line.replace('\x00', '')
 # It's a legal header line, save it.
 hlist.append(line)
- self.addheader(headerseen,
line[len(headerseen)+1:].strip())
+ value = line[len(headerseen)+1:].strip()
+ self.addheader(headerseen, value)
 continue
 else:
 # It's not a header line; throw it back and stop here.
msg66889 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008年05月16日 03:07
The issue is not just with null character. If you observe now the
diretion is 302-302-200 and there is no null character.
However, still urllib2 is unable to handle multiple redirection properly
(IIRC, there is a portion of code to handle multiple redirection and
exit on infinite loop)
>>> url = "http://www.wikispaces.com"
>>> opened = urllib.urlopen(url)
>>> print opened.geturl()
http://www.wikispaces.com?responseToken=344289da354a29c67d48928dbe72042a
>>> print opened.read()
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>nginx/0.6.30</center>
</body>
</html>
Needs a relook, IMO.
msg71231 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008年08月16日 19:30
Senthil:
Look at that URL that the server returned in the second redirect:
http://www.wikispaces.com?responseToken=ee3fca88a9b0dc865152d8a9e5b6138d
See that the "?" appears without a path between the host and it.
Check the item 3.2.2 in the RFC 2616, it says that a HTTP URL should be:
 http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]
So, we should fix that URL that the server returned. Guess what: if we
put a "/" (as obligates the RFC), everything works ok.
The patch I attach here does that. All tests pass ok.
What do you think?
msg71255 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008年08月17日 01:20
looks good to me.
msg71259 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008年08月17日 02:42
Ah, I that was a simple fix. :) I very much overlooked the problem after
being so much given the hints at the web-sig.
I have some comments on the patch, Facundo.
1) I don't think is a good idea to include that portion in the
http_error_302 method. That makes the fix "very" specific to "this"
issue only. 
Another point is, fixing broken url's should not be under urllib2,
urlparse would be a better place.
So, I came up with the approach wherein urllib2 does unparse(parse) of
the url and parse methods will fix the url if it is broken. ( See
attached issue2464-PATCH1.diff)
But if we handle it in the urlparse methods, then we are much
susceptible to breaking RFC conformance, breaking a lot of tests, Which
is not a good idea.
So,I introduced fix_broken() method in urlparse and called it to solve
the issue, using the same logic as yours (issue2464-py26-FINAL.diff)
With fix_broken() method in urlparse, we will have better control
whenever we want to implement a behavior which is RFC non-confirming but
implemented widely by browsers and clients.
All tests pass with issue2464-py26-FINAL.diff
Comments,please?
msg71260 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008年08月17日 02:56
Patch for py3k, but please test this before applying.
msg71261 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008年08月17日 03:39
Senthil: I don't like that.
Creating a public method called "fix_broken", introducing new behaviours
now in beta, and actually not fixing the url in any broken possibility
(just the path if it's not there), it's way too much for this fix.
I commited the change I proposed. Maybe in the future will have a
"generic url fixing" function, now is not the moment.
msg71295 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008年08月17日 22:38
i was pondering if it should go in urlparse instead. if it did, i think
it should be part of urlparse.urlunparse to ensure that there is always
a trailing slash after the host:port regardless of what the inputs are.
anyways, agreed, this fixes this specific bug. should it be backported
to release25-maint?
msg71303 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008年08月18日 00:08
Maybe we can put it in urlunparse... do you all agree with this test cases?
def test_alwayspath(self):
 u = urlparse.urlparse("http://netloc/path;params?query#fragment")
 self.assertEqual(urlparse.urlunparse(u),
"http://netloc/path;params?query#fragment")
 u = urlparse.urlparse("http://netloc?query#fragment")
 self.assertEqual(urlparse.urlunparse(u),
"http://netloc/?query#fragment")
 u = urlparse.urlparse("http://netloc#fragment")
 self.assertEqual(urlparse.urlunparse(u), "http://netloc/#fragment")
Maybe we could backport this more general fix...
msg71304 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008年08月18日 00:51
That test case looks good to me for 2.6 and 3.0. Also add a note to the
documentation with a versionchanged 2.6 about urlunparse always ensuring
there is a / between the netloc and the rest of the url.
I would not back port the more general urlunparse behavior change to 2.5.
msg71990 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008年08月26日 17:51
Gregory... I tried to fill the path in urlunparse, and other functions
that use this started to fail.
As we're so close to final releases, I'll leave this as it's right now,
that actually fixed the bug...
msg72024 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008年08月27日 12:50
That was reason in making fix_broken in the urlparse in my patch, Facundo.
I had thought, it should be handled in urlparse module and if we make
changes in the urlparse.urlunparse/urlparse.urlparse, then we are
stepping into area which will break a lot of tests.
I am kind of +0 with the current fix in urllib2. Should we think/plan
for something in urlparse, akin to fix_broken?
msg76736 - (view) Author: John J Lee (jjlee) Date: 2008年12月02日 13:00
This fix was applied in the wrong place.
URI path components, and HTTP URI path components in particular, *can*
be empty. See RFC 3986. So the comment in the code that was inserted
with the fix for this bug that says "possibly malformed" is incorrect,
and should instead just refer to section 3.2.2 of RFC 2616. Also,
because 3.2.2 says "If the abs_path is not present in the URL, it MUST
be given as "/" when used as a Request-URI for a resource (section
5.1.2)", it seems clear that this transformation (add the slash if
there's no path component) should always happen when retrieving URIs,
regardless of where the URI came from -- not only for redirects.
Note that RFC 2616 incorrectly claims to refer to the definition of
abs_path from RFC 2396. The fact that it's incorrect is made clear in
2616 itself, in section 3.2.3, when it says that abs_path can be empty.
 In any case, RFC 2396 is obsoleted by RFC 3986, which is clear on this
issue, and reflects actual usage of URIs. URIs like http://python.org
and http://python.org?spam=eggs have been in widespread use for a long
time, and typing the latter URL into firefox (3.0.2) confirms that
what's actually sent is "/?spam", whereas urllib2 still sends "?spam".
No test was added with this fix, which makes it unnecessarily hard to
work out what exactly the fix was supposed to fix. For the record, this
is the sequence of redirections before the fix was applied (showing base
URI + redirect URI reference --> redirect URI):
'http://www.wikispaces.com' +
'https://session.wikispaces.com/session/auth?authToken=token' -->
'https://session.wikispaces.com/session/auth?authToken=token'
'https://session.wikispaces.com/session/auth?authToken=token' +
'http://www.wikispaces.com?responseToken=token' -->
'http://www.wikispaces.com?responseToken=token'
and after the fix was applied:
'http://www.wikispaces.com' +
'https://session.wikispaces.com/session/auth?authToken=token' -->
'https://session.wikispaces.com/session/auth?authToken=token'
'https://session.wikispaces.com/session/auth?authToken=token' +
'http://www.wikispaces.com/?responseToken=token' -->
'http://www.wikispaces.com/?responseToken=token'
'http://www.wikispaces.com/?responseToken=token' +
'http://www.wikispaces.com/' --> 'http://www.wikispaces.com/'
msg76779 - (view) Author: John J Lee (jjlee) Date: 2008年12月02日 20:47
I've raised #4493 about the issue I raised in my previous comment.
History
Date User Action Args
2022年04月11日 14:56:32adminsetgithub: 46716
2009年02月12日 19:14:38ajaksu2linkissue4493 dependencies
2008年12月02日 20:47:14jjleesetmessages: + msg76779
2008年12月02日 13:00:14jjleesetnosy: + jjlee
messages: + msg76736
2008年08月27日 12:50:15orsenthilsetmessages: + msg72024
2008年08月26日 17:51:47facundobatistasetmessages: + msg71990
2008年08月18日 00:51:46gregory.p.smithsetmessages: + msg71304
2008年08月18日 00:08:53facundobatistasetmessages: + msg71303
2008年08月17日 22:38:06gregory.p.smithsetmessages: + msg71295
2008年08月17日 03:39:08facundobatistasetstatus: open -> closed
resolution: fixed
messages: + msg71261
2008年08月17日 02:56:26orsenthilsetfiles: + issue2463-py3k.diff
messages: + msg71260
2008年08月17日 02:42:36orsenthilsetfiles: + issue2464-PATCH1.diff
2008年08月17日 02:42:19orsenthilsetfiles: + issue2464-py26-FINAL.diff
messages: + msg71259
2008年08月17日 01:20:47gregory.p.smithsetmessages: + msg71255
2008年08月16日 19:30:08facundobatistasetfiles: + issue2464-facundo.diff
keywords: + patch
messages: + msg71231
2008年07月03日 18:16:02facundobatistasetassignee: facundobatista
nosy: + facundobatista
2008年05月16日 03:07:50orsenthilsetnosy: + orsenthil
messages: + msg66889
2008年03月29日 04:19:08gregory.p.smithsetpriority: normal -> low
assignee: gregory.p.smith -> (no value)
messages: + msg64680
2008年03月29日 03:48:40gregory.p.smithsetmessages: + msg64679
2008年03月29日 03:41:53weijie90setmessages: + msg64677
2008年03月29日 03:13:38gregory.p.smithsetpriority: normal
assignee: gregory.p.smith
messages: + msg64676
nosy: + gregory.p.smith
versions: + Python 2.6, Python 3.0
2008年03月23日 14:41:09weijie90create

AltStyle によって変換されたページ (->オリジナル) /