Message135776
| Author |
lemburg |
| Recipients |
belopolsky, ezio.melotti, georg.brandl, lemburg, moese, phr, vstinner |
| Date |
2011年05月11日.13:31:59 |
| SpamBayes Score |
1.3358031e-10 |
| Marked as misclassified |
No |
| Message-id |
<4DCA8FC6.7050902@egenix.com> |
| In-reply-to |
<1305118551.23.0.106349254941.issue2857@psf.upfronthosting.co.za> |
| Content |
Thanks for the patch, Victor.
Some comments on the patch:
* the codec will have to be able to work with lone surrogates
(see the wikipedia page explaining this detail), which the
UTF-8 codec in Python 3.x no longer does, so another special
case is due for this difference
* we should not make the standard UTF-8 codec slower just to
support a variant of UTF-8 which will only get marginal use;
for the decoder, the changes are minimal, so that's fine,
but for the decoder you are changing the most often used
code branch to check for NUL bytes - we need a better solution
for this, even if it means having to use a separte encode_utf8java
function
Since the ticket was opened in 2008, the common name of the
codec appears to have changed from "UTF-8 Java" to "Modified UTF-8"
or "MUTF-8" as short alias:
* http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
(change in http://en.wikipedia.org/w/index.php?title=UTF-8&diff=next&oldid=291829304)
* http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
(scroll down to "Modified UTF-8")
* http://developer.android.com/reference/java/io/DataInput.html
(this is for Android)
So I guess we should adapt to the name to the now common name
and call it "ModifiedUTF8" in the C API and add these aliases:
"utf-8-modified", "mutf-8" and "modified-utf-8". |
|