This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2013年02月16日 10:52 by Mi.Zou, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| patch_to_urllib_handle_non_ascii_char_in_url.txt | vajrasky, 2013年07月18日 10:29 | review | ||
| issue17214.patch | christian.heimes, 2013年07月19日 09:23 | review | ||
| issue17214.redirect.v2.patch | martin.panter, 2015年10月30日 23:50 | review | ||
| Messages (17) | |||
|---|---|---|---|
| msg182216 - (view) | Author: Mi Zou (Mi.Zou) | Date: 2013年02月16日 10:52 | |
while urllib following the redirection(302):
urllib.client.HTTPConnection.putrequest raise an error:
#----------------------------------------------------------
File "D:\Program Files\Python32\lib\http\client.py", line 1004, in _send_request
self.putrequest(method, url, **skips)
File "D:\Program Files\Python32\lib\http\client.py", line 868, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 108-111: ordinal not in range(128)
#----------------------------------------------------------
in the sourcode i found that:
at line 811
def putrequest(self, method, url, skip_host=0,skip_accept_encoding=0)...
the argument url may be a unicode,and it was unquoted..
i think we should replace:
request = '%s %s %s' (method,url,self._http_vsn_str)
with:
import urllib.parse
request = '%s %s %s' (method,urllib.parse.quote(url),self._http_vsn_str)
|
|||
| msg182218 - (view) | Author: Mi Zou (Mi.Zou) | Date: 2013年02月16日 12:30 | |
while urllib following the redirection(302):
http.client.HTTPConnection.putrequest raise an error:
#----------------------------------------------------------
...
File "D:\Program Files\Python32\lib\http\client.py", line 1004, in _send_request
self.putrequest(method, url, **skips)
File "D:\Program Files\Python32\lib\http\client.py", line 868, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 108-111: ordinal not in range(128)
#----------------------------------------------------------
in the sourcode i found that:
at line 811
def putrequest(self, method, url, skip_host=0,skip_accept_en...)
...
the argument url may be a unicode,and it was unquoted..
----------------------------note----------------------------------------
in my case:
...
purl="http://bbs.dospy.com/1111258attachdown.php?aid=14361277&bbsid=349"
req=urllib.request.Request(purl,headers=headers)
response=urllib.request.urlopen(req)
...
then,the http serve redirect me to a file download url...
and the url contains some Chinese word....
i have print out the argument url:
/f/1ba1f70606223af2aa5c3aeff6c6a46a/511f7b4c/day_111015/20111015_5949e996881b2e28403d26Ch6dOfj6LZ.rar/p/ÒâÁÖ03-08.part1.rar
|
|||
| msg182679 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2013年02月22日 18:16 | |
Please give us
1. the exact Python version used. 3.2.3? or something earlier?
2. A minimal but complete example that we can run. What is 'headers'?
3. The complete traceback, not just the last two entries.
4. The result of running with the newer 3.3.0, if you possibly can. Perhaps the problem has already been fixed.
While line numbers have changed, even in 3.2.4 in repository, 3.2-3.4 all have
request = '%s %s %s' % (method, url, self._http_vsn_str)
# Non-ASCII characters should have been eliminated earlier
self._output(request.encode('ascii'))
Since there is nothing earlier in the function that would eliminate non-ascii, there must be an assumption about what happens earlier in the call chain. That might have already been fixed, which is why we need an example to test.
|
|||
| msg193183 - (view) | Author: Lars Ivarsson (LDTech) | Date: 2013年07月16日 16:43 | |
This problem still exist in Python 3.3.2. The following code gives you an example: import urllib.request url = "http://www.libon.it/libon/search/isbn/3499155443" req = urllib.request.Request(url) response = urllib.request.urlopen(req, timeout=30) the_page = response.read().decode('utf-8') print(the_page) Traceback (most recent call last): File "C:\X\webpy.py", line 4, in <module> response = urllib.request.urlopen(req, timeout=30) File "C:\Python33\lib\urllib\request.py", line 156, in urlopen return opener.open(url, data, timeout) File "C:\Python33\lib\urllib\request.py", line 475, in open response = meth(req, response) File "C:\Python33\lib\urllib\request.py", line 587, in http_response 'http', request, response, code, msg, hdrs) File "C:\Python33\lib\urllib\request.py", line 507, in error result = self._call_chain(*args) File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain result = func(*args) File "C:\Python33\lib\urllib\request.py", line 692, in http_error_302 return self.parent.open(new, timeout=req.timeout) File "C:\Python33\lib\urllib\request.py", line 469, in open response = self._open(req, data) File "C:\Python33\lib\urllib\request.py", line 487, in _open '_open', req) File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain result = func(*args) File "C:\Python33\lib\urllib\request.py", line 1268, in http_open return self.do_open(http.client.HTTPConnection, req) File "C:\Python33\lib\urllib\request.py", line 1248, in do_open h.request(req.get_method(), req.selector, req.data, headers) File "C:\Python33\lib\http\client.py", line 1061, in request self._send_request(method, url, body, headers) File "C:\Python33\lib\http\client.py", line 1089, in _send_request self.putrequest(method, url, **skips) File "C:\Python33\lib\http\client.py", line 953, in putrequest self._output(request.encode('ascii')) UnicodeEncodeError: 'ascii' codec can't encode characters in position 78-79: ordinal not in range(128) |
|||
| msg193279 - (view) | Author: Vajrasky Kok (vajrasky) * | Date: 2013年07月18日 10:29 | |
The script for demonstrating bug can be simplified to: ----------------------------------------------------------------------- import urllib.request url = "http://www.libon.it/ricerca/7817940/3499155443/dettaglio/3102314/Onkel-Oswald-und-der-Sudan-Käfer/order/date_desc" req = urllib.request.Request(url) response = urllib.request.urlopen(req, timeout=30) the_page = response.read().decode('utf-8') print(the_page) ----------------------------------------------------------------------- Attached the simple patch to solve this problem. The question is whether we should fix this problem in urllib or not because strictly speaking the url should be ascii characters only. But if the Firefox can open this url, why not urllib? I will contemplate about this problem and if I (or other people) think that urllib should handle url containing non-ascii characters, then I will add additional unit test. Until then, people can use third party package, which is request package from http://docs.python-requests.org/en/latest/ ---------------------------------------------------------------- r = requests.get("http://www.libon.it/ricerca/7817940/3499155443/dettaglio/3102314/Onkel-Oswald-und-der-Sudan-Käfer/order/date_desc") print(r.text) ---------------------------------------------------------------- |
|||
| msg193286 - (view) | Author: Christian Heimes (christian.heimes) * (Python committer) | Date: 2013年07月18日 12:32 | |
The problem may not be a bug but a deliberate design choice. urllib is rather low level and doesn't implement some browser magic. Browsers handle stuff like 'ä' -> '%C3%A4', ' ' -> '%20' or IDNA but urllib doesn't. I always saw it as may responsibility to quote and encode everything myself. Higher level APIs such as requests are free to implement browser magic. Contrary to common believes an URL with an umlaut or space is *not* a valid URI. From http://docs.python.org/3/library/urllib.request.html#urllib.request.Request > url should be a string containing a valid URL. I suggest that this ticket shall be closed as "won't fix". |
|||
| msg193306 - (view) | Author: Vajrasky Kok (vajrasky) * | Date: 2013年07月18日 15:10 | |
I have no problem if this ticket is classified as "won't fix". I am writing this for the confused souls who want to use urllib to access url containing non-ascii characters: import urllib.request from urllib.parse import quote url = "http://www.libon.it/ricerca/7817940/3499155443/dettaglio/3102314/Onkel-Oswald-und-der-Sudan-Käfer/order/date_desc" req = urllib.request.Request(url) try: req.selector.encode('ascii') except UnicodeEncodeError: req.selector = quote(req.selector) response = urllib.request.urlopen(req, timeout=30) the_page = response.read().decode('utf-8') print(the_page) |
|||
| msg193312 - (view) | Author: Lars Ivarsson (LDTech) | Date: 2013年07月18日 16:41 | |
The problem isn't the original requested url, as it is legit. The problem appears after the 302 redirect when a new (malformed) url is received from the server. There need to be some kind of check of the validity of that second url. And, preferably, an URLError returned if something is wrong. |
|||
| msg193346 - (view) | Author: Vajrasky Kok (vajrasky) * | Date: 2013年07月19日 04:45 | |
Lars, I see. For the uninitiated, the issue is the original url (containing only ascii character) redirects to the url containing non-ascii characters which upsets urllib. To handle that situation, you can do something like this: --------------------- import urllib.request from urllib.parse import quote url = "http://www.libon.it/libon/search/isbn/3499155443" req = urllib.request.Request(url) req.selector = urllib.parse.quote(req.selector) response = urllib.request.urlopen(req, timeout=30) the_page = response.read().decode('utf-8') print(the_page) --------------------- I admit it that this code is clunky and not pythonic. I also believe in python standard library, we should have a module to access url containing non-ascii character in an easy manner. At the very least, maybe we can give proper error message. Something like this would be nice: "The url is not valid and contains non-ascii character: http://www.libon.it/ricerca/7817940/3499155443/dettaglio/3102314/Onkel-Oswald-und-der-Sudan-Käfer/order/date_desc. This url is redirected from this url: http://www.libon.it/libon/search/isbn/3499155443" Because users can be confused. They thought they already gave only-ascii-characters url (http://www.libon.it/libon/search/isbn/3499155443) to urllib, but why did they get encoding error? What do you say, Christian? |
|||
| msg193352 - (view) | Author: Christian Heimes (christian.heimes) * (Python committer) | Date: 2013年07月19日 09:23 | |
Something else is going on here. A valid server never returns an URL with non-ASCII chars. Your test server does the right thing, too: $ LC_ALL=C wget http://www.libon.it/libon/search/isbn/3499155443 --2013年07月19日 11:01:54-- http://www.libon.it/libon/search/isbn/3499155443 Resolving www.libon.it (www.libon.it)... 83.103.59.131 Connecting to www.libon.it (www.libon.it)|83.103.59.131|:80... connected. HTTP request sent, awaiting response... 302 Moved Temporarily Location: http://www.libon.it/ricerca/7818684/3499155443/dettaglio/3102314/Onkel-Oswald-und-der-Sudan-K%C3%A4fer/order/date_desc [following] Incomplete or invalid multibyte sequence encountered --2013年07月19日 11:01:54-- http://www.libon.it/ricerca/7818684/3499155443/dettaglio/3102314/Onkel-Oswald-und-der-Sudan-K%C3%A4fer/order/date_desc Reusing existing connection to www.libon.it:80. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] I have digged through the code. Now I think that I know what's going on here. The header parsing code unquotes and converts the Location header. The code in the 302 handler doesn't compensate and therefore fails. Here is a patch that corrects the code in the 302 function. |
|||
| msg239648 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年03月31日 00:13 | |
I think this patch needs a test. I left some comments on Reitveld as well. Perhaps there should also be a test to prove that redirects to URLs like /spaced%20path/ do not get mangled. Have a look at the HTTPRedirectHandler.redirect_request() method. Perhaps the code translating spaces to %20 could be merged with the fix for this issue. |
|||
| msg253409 - (view) | Author: Michael (Strecke) | Date: 2015年10月24日 15:59 | |
The patch issue17214 did fix this issue in my 3.4.2 install on Ubuntu LTS. It triggered however another bug: File "/usr/local/lib/python3.4/urllib/request.py", line 646, in http_error_302 path = urlparts.path if urlpaths.path else "/" NameError: name 'urlpaths' is not defined This is obviously a typo. I'm not sure if that one has been reported yet (a short google search didn't find anything) and I don't know how to provoke it independently. |
|||
| msg253410 - (view) | Author: Michael (Strecke) | Date: 2015年10月24日 16:08 | |
I should have looked more closely. The typo is part of the patch. It should be corrected there. |
|||
| msg253771 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年10月30日 23:50 | |
This bug only applies to Python 3. In Python 2, the non-ASCII bytes are sent through to the redirect target verbatim. I think this would also be the ideal way to handle the problem in 3, but percent-encoding them as proposed also seems good enough, and does not require hacking the HTTPConnection.putrequest() internals. My patch updates Christian’s patch: * Tested, so hopefully no typos :) * Add test cases based on Issue 22248, as well as a URL already including a percent sign * Process entire URL, not just the path component. A non-ASCII byte could just as easily be in the query component, for example. * Remove redundant encoding of space character from redirect_request() method. |
|||
| msg265521 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2016年05月14日 11:22 | |
I will look at committing this soon |
|||
| msg265682 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2016年05月16日 08:15 | |
New changeset cb09fdef19f5 by Martin Panter in branch '3.5': Issue #17214: Percent-encode non-ASCII bytes in redirect targets https://hg.python.org/cpython/rev/cb09fdef19f5 New changeset 841a9a3f3cf6 by Martin Panter in branch 'default': Issue #14132, Issue #17214: Merge two redirect handling fixes from 3.5 https://hg.python.org/cpython/rev/841a9a3f3cf6 |
|||
| msg265690 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2016年05月16日 09:44 | |
I restored the "redundant" encoding of space, in case someone’s code was relying on this behaviour, and because redirect_request() is a publicly documented method. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:41 | admin | set | github: 61416 |
| 2016年05月16日 09:44:51 | martin.panter | set | status: open -> closed resolution: fixed messages: + msg265690 stage: commit review -> resolved |
| 2016年05月16日 08:15:09 | python-dev | set | nosy:
+ python-dev messages: + msg265682 |
| 2016年05月14日 11:22:24 | martin.panter | set | stage: patch review -> commit review messages: + msg265521 versions: - Python 3.4 |
| 2015年10月31日 06:01:51 | terry.reedy | set | nosy:
- terry.reedy |
| 2015年10月30日 23:50:30 | martin.panter | set | files:
+ issue17214.redirect.v2.patch stage: test needed -> patch review messages: + msg253771 versions: + Python 3.5, Python 3.6 |
| 2015年10月24日 19:01:53 | berker.peksag | set | nosy:
+ berker.peksag |
| 2015年10月24日 16:08:30 | Strecke | set | messages: + msg253410 |
| 2015年10月24日 15:59:47 | Strecke | set | nosy:
+ Strecke messages: + msg253409 versions: - Python 3.2, Python 3.3 |
| 2015年06月08日 04:48:40 | Uche Ogbuji | set | nosy:
+ Uche Ogbuji |
| 2015年04月11日 11:58:11 | martin.panter | link | issue22248 superseder |
| 2015年03月31日 00:13:07 | martin.panter | set | nosy:
+ martin.panter messages: + msg239648 |
| 2013年07月19日 09:23:28 | christian.heimes | set | files:
+ issue17214.patch keywords: + patch messages: + msg193352 |
| 2013年07月19日 04:45:12 | vajrasky | set | messages: + msg193346 |
| 2013年07月18日 16:41:24 | LDTech | set | messages: + msg193312 |
| 2013年07月18日 15:10:43 | vajrasky | set | messages: + msg193306 |
| 2013年07月18日 12:32:36 | christian.heimes | set | nosy:
+ christian.heimes messages: + msg193286 |
| 2013年07月18日 10:29:06 | vajrasky | set | files:
+ patch_to_urllib_handle_non_ascii_char_in_url.txt nosy: + vajrasky messages: + msg193279 |
| 2013年07月16日 16:43:30 | LDTech | set | nosy:
+ LDTech messages: + msg193183 |
| 2013年02月22日 18:16:09 | terry.reedy | set | type: behavior components: + Library (Lib), - Unicode versions: + Python 3.3, Python 3.4 nosy: + terry.reedy, orsenthil messages: + msg182679 stage: test needed |
| 2013年02月16日 12:30:44 | Mi.Zou | set | status: closed -> open resolution: not a bug -> (no value) messages: + msg182218 |
| 2013年02月16日 12:05:56 | Mi.Zou | set | title: urllib.client.HTTPConnection.putrequest encode error -> http.client.HTTPConnection.putrequest encode error |
| 2013年02月16日 12:05:08 | Mi.Zou | set | resolution: not a bug |
| 2013年02月16日 12:04:22 | Mi.Zou | set | status: open -> closed |
| 2013年02月16日 10:52:39 | Mi.Zou | create | |