Message 69581 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	lemburg
Recipients	Rhamphoryncus, ezio.melotti, lemburg
Date	2008年07月12日.09:37:04
SpamBayes Score	0.004955014
Marked as misclassified	No
Message-id	<1215855427.58.0.456732436843.issue3297@psf.upfronthosting.co.za>

Content
Adam, I do know what I'm talking about: I was the lead designer of the Unicode integration you find in Python and implemented most of it. What you see as repr() of a Unicode object is the result of applying a codec to the internal representation. Please don't confuse the output of the codec ("unicode-escape") with the internal representation. That said, Ezio did uncover a bug and we need to find the cause. It's likely caused by the fact that the UTF-8 codec does not recombine surrogates on UCS4 builds. See this comment in the codec implementation: case 3: if ((s[1] & 0xc0) != 0x80 \|\| (s[2] & 0xc0) != 0x80) { errmsg = "invalid data"; startinpos = s-starts; endinpos = startinpos+3; goto utf8Error; } ch = ((s[0] & 0x0f) << 12) + ((s[1] & 0x3f) << 6) + (s[2] & 0x3f); if (ch < 0x0800) { /* Note: UTF-8 encodings of surrogates are considered legal UTF-8 sequences; XXX For wide builds (UCS-4) we should probably try to recombine the surrogates into a single code unit. / errmsg = "illegal encoding"; startinpos = s-starts; endinpos = startinpos+3; goto utf8Error; } else p++ = (Py_UNICODE)ch; break;

Content

Adam, I do know what I'm talking about: I was the lead designer of the
Unicode integration you find in Python and implemented most of it.
What you see as repr() of a Unicode object is the result of applying a
codec to the internal representation. Please don't confuse the output of
the codec ("unicode-escape") with the internal representation.
That said, Ezio did uncover a bug and we need to find the cause. It's
likely caused by the fact that the UTF-8 codec does not recombine
surrogates on UCS4 builds. See this comment in the codec implementation:
 case 3:
 if ((s[1] & 0xc0) != 0x80 ||
 (s[2] & 0xc0) != 0x80) {
 errmsg = "invalid data";
		startinpos = s-starts;
		endinpos = startinpos+3;
		goto utf8Error;
	 }
 ch = ((s[0] & 0x0f) << 12) + ((s[1] & 0x3f) << 6) + (s[2] &
0x3f);
 if (ch < 0x0800) {
		/* Note: UTF-8 encodings of surrogates are considered
		 legal UTF-8 sequences;
		 XXX For wide builds (UCS-4) we should probably try
		 to recombine the surrogates into a single code
		 unit.
		*/
 errmsg = "illegal encoding";
		startinpos = s-starts;
		endinpos = startinpos+3;
		goto utf8Error;
	 }
	 else
		*p++ = (Py_UNICODE)ch;
 break;

History
Date	User	Action	Args
2008年07月12日 09:37:07	lemburg	set	spambayes_score: 0.00495501 -> 0.004955014 recipients: + lemburg, Rhamphoryncus, ezio.melotti
2008年07月12日 09:37:07	lemburg	set	spambayes_score: 0.00495501 -> 0.00495501 messageid: <1215855427.58.0.456732436843.issue3297@psf.upfronthosting.co.za>
2008年07月12日 09:37:06	lemburg	link	issue3297 messages
2008年07月12日 09:37:04	lemburg	create

homepage