This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
| Author | mgiuca |
|---|---|
| Recipients | mgiuca |
| Date | 2008年07月06日.14:52:06 |
| SpamBayes Score | 4.442693e-05 |
| Marked as misclassified | No |
| Message-id | <1215355930.42.0.79499861143.issue3300@psf.upfronthosting.co.za> |
| In-reply-to |
| Content | |
|---|---|
Three Unicode-related problems with urllib.parse.quote and urllib.parse.unquote in Python 3.0. (Patch attached). Firstly, unquote appears not to have been modified from Python 2, where it is designed to output a byte string. In Python 3, it outputs a unicode string, implicitly decoded as ISO-8859-1 (the code points are the same as the bytes). RFC 3986 states that the percent-encoded byte values should be decoded as UTF-8. http://tools.ietf.org/html/rfc3986 section 2.5. Current behaviour: >>> urllib.parse.unquote("%CE%A3") 'Σ' (or '\u00ce\u00a3') Desired behaviour: >>> urllib.parse.unquote("%CE%A3") 'Σ' (or '\u03a3') Secondly, while quote *has* been modified to encode to UTF-8 before percent-encoding, it does not work correctly for characters in range(128, 256), due to a special case in the code which again treats the code point values as byte values. Current behaviour: >>> urllib.parse.quote('\u00e9') '%E9' Desired behaviour: >>> urllib.parse.quote('\u00e9') '%C3%A9' Note that currently, quoting characters less than 256 will use ISO-8859-1, while quoting characters 256 or higher will use UTF-8! Thirdly, the "safe" argument to quote does not work for characters above 256, since these are excluded from the special case. I thought I would fix this at the same time, but it's really a separate issue. Current behaviour: >>> urllib.parse.quote('Σκ', safe='Σ') '%CE%A3%CF%B0' Desired behaviour: >>> urllib.parse.quote('Σκ', safe='Σ') 'Σ%CF%B0' A patch which fixes all three issues is attached. Note that unquote now needs to handle the case where the UTF-8 sequence is invalid. This is currently handled by "replace" (invalid sequences are replaced by '\ufffd'). I would like to add an optional "errors" argument to unquote, defaulting to "replace", to allow the user to override this behaviour, but I didn't put that in because it would change the interface. Note I also changed one of the test cases, which had the wrong expected output. (String literal was manually UTF-8 encoded, designed for Python 2; nonsensical when viewed as a Python 3 Unicode string). All urllib test cases pass. Patch is for branch /branches/py3k, revision 64752. Note: The above unquote issue also manifests itself in Python 2 for Unicode strings, but it's hazy as to what the behaviour should be, and would break existing programs, so I'm just patching the Py3k branch. Commit log: urllib.parse.unquote: Fixed percent-encoded octets being implicitly decoded as ISO-8859-1; now decode as UTF-8, as per RFC 3986. urllib.parse.quote: Fixed characters in range(128, 256) being implicitly encoded as ISO-8859-1; now encode as UTF-8. Also fixed characters greater than 256 not responding to "safe", and also not being cached. Lib/test/test_urllib.py: Updated one test case for unquote which expected the wrong output. The new version of unquote passes the new test case. |
|
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2008年07月06日 14:52:10 | mgiuca | set | spambayes_score: 4.44269e-05 -> 4.442693e-05 recipients: + mgiuca |
| 2008年07月06日 14:52:10 | mgiuca | set | spambayes_score: 4.44269e-05 -> 4.44269e-05 messageid: <1215355930.42.0.79499861143.issue3300@psf.upfronthosting.co.za> |
| 2008年07月06日 14:52:09 | mgiuca | link | issue3300 messages |
| 2008年07月06日 14:52:08 | mgiuca | create | |