This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
| Author | lemburg |
|---|---|
| Recipients | Rhamphoryncus, ezio.melotti, lemburg |
| Date | 2008年07月12日.09:37:04 |
| SpamBayes Score | 0.004955014 |
| Marked as misclassified | No |
| Message-id | <1215855427.58.0.456732436843.issue3297@psf.upfronthosting.co.za> |
| In-reply-to |
| Content | |
|---|---|
Adam, I do know what I'm talking about: I was the lead designer of the
Unicode integration you find in Python and implemented most of it.
What you see as repr() of a Unicode object is the result of applying a
codec to the internal representation. Please don't confuse the output of
the codec ("unicode-escape") with the internal representation.
That said, Ezio did uncover a bug and we need to find the cause. It's
likely caused by the fact that the UTF-8 codec does not recombine
surrogates on UCS4 builds. See this comment in the codec implementation:
case 3:
if ((s[1] & 0xc0) != 0x80 ||
(s[2] & 0xc0) != 0x80) {
errmsg = "invalid data";
startinpos = s-starts;
endinpos = startinpos+3;
goto utf8Error;
}
ch = ((s[0] & 0x0f) << 12) + ((s[1] & 0x3f) << 6) + (s[2] &
0x3f);
if (ch < 0x0800) {
/* Note: UTF-8 encodings of surrogates are considered
legal UTF-8 sequences;
XXX For wide builds (UCS-4) we should probably try
to recombine the surrogates into a single code
unit.
*/
errmsg = "illegal encoding";
startinpos = s-starts;
endinpos = startinpos+3;
goto utf8Error;
}
else
*p++ = (Py_UNICODE)ch;
break; |
|
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2008年07月12日 09:37:07 | lemburg | set | spambayes_score: 0.00495501 -> 0.004955014 recipients: + lemburg, Rhamphoryncus, ezio.melotti |
| 2008年07月12日 09:37:07 | lemburg | set | spambayes_score: 0.00495501 -> 0.00495501 messageid: <1215855427.58.0.456732436843.issue3297@psf.upfronthosting.co.za> |
| 2008年07月12日 09:37:06 | lemburg | link | issue3297 messages |
| 2008年07月12日 09:37:04 | lemburg | create | |