Message102265
| Author |
vstinner |
| Recipients |
dangra, ezio.melotti, lemburg, sjmachin, vstinner |
| Date |
2010年04月03日.14:43:21 |
| SpamBayes Score |
3.029281e-09 |
| Marked as misclassified |
No |
| Message-id |
<1270305802.77.0.0073241346145.issue8271@psf.upfronthosting.co.za> |
| In-reply-to |
| Content |
> I also found out that, according to RFC 3629, surrogates
> are considered invalid and they can't be encoded/decoded,
> but the UTF-8 codec actually does it.
Python2 does, but Python3 raises an error.
Python 2.7a4+ (trunk:79675, Apr 3 2010, 16:11:36)
>>> u"\uDC80".encode("utf8")
'\xed\xb2\x80'
Python 3.2a0 (py3k:79441, Mar 26 2010, 13:04:55)
>>> "\uDC80".encode("utf8")
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
Deny encoding surrogates (in utf8) causes a lot of crashs in Python3, because most functions calling suppose that _PyUnicode_AsString() does never fail: see #6687 (and #8195 and a lot of other crashs). It's not a good idea to change it in Python 2.7, because it would require a huge work and we are close to the first beta of 2.7. |
|
History
|
|---|
| Date |
User |
Action |
Args |
| 2010年04月03日 14:43:22 | vstinner | set | recipients:
+ vstinner, lemburg, sjmachin, ezio.melotti, dangra |
| 2010年04月03日 14:43:22 | vstinner | set | messageid: <1270305802.77.0.0073241346145.issue8271@psf.upfronthosting.co.za> |
| 2010年04月03日 14:43:21 | vstinner | link | issue8271 messages |
| 2010年04月03日 14:43:21 | vstinner | create |
|