homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add "java modified utf-8" codec
Type: enhancement Stage: test needed
Components: Library (Lib), Unicode Versions: Python 3.3
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, ezio.melotti, georg.brandl, lemburg, loewis, moese, phr, serhiy.storchaka, tchrist, vstinner
Priority: normal Keywords: patch

Created on 2008年05月15日 03:08 by phr, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
utf_8_java.patch vstinner, 2011年05月11日 12:55 review
Messages (26)
msg66843 - (view) Author: paul rubin (phr) Date: 2008年05月15日 03:08
For object serialization and some other purposes, Java encodes unicode
strings with a modified version of utf-8:
http://en.wikipedia.org/wiki/UTF-8#Java
http://java.sun.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8
It is used in Lucene index files among other places.
It would be useful if Python had a codec for this, maybe called "UTF-8J"
or something like that.
msg66852 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008年05月15日 09:55
What would you use such a codec for ?
From the references you gave, it is only used internally for Java object
serialization, so wouldn't really be of much use in Python.
msg66854 - (view) Author: paul rubin (phr) Date: 2008年05月15日 10:55
Some java applications use it externally. The purpose seems to be to
prevent NUL bytes from appearing inside encoded strings which can
confuse C libraries that expect NUL's to terminate strings. My
immediate application is parsing lucene indexes:
http://lucene.apache.org/java/docs/fileformats.html#Chars 
msg66855 - (view) Author: paul rubin (phr) Date: 2008年05月15日 10:59
Also, according to wikipedia, tcl also uses that encoding.
msg66857 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008年05月15日 11:54
TCL only uses the codec for internal represenation. You might want to
interface to TCL at the C level and use the codec there, but is that
really a good reason to include the codec in the Python stdlib ?
Dito for parsing Lucene indexes.
I think you're better off writing your own codec and registering it with
the Python codec registry at application start-up time.
msg66862 - (view) Author: paul rubin (phr) Date: 2008年05月15日 14:39
I'm not sure what you mean by "ditto for Lucene indexes". I wasn't
planning to use C code. I was hoping to write Python code to parse
those indexes, then found they use this weird encoding, and Python's
codec set is fairly inclusive already, so this codec sounded like a
reasonably useful addition. It probably shows up other places as well.
 It might even be a reasonable internal representation for Python, which
as I understand it currently can't represent codepoints outside the BMP.
 Also, it is used in Java serialization, which I think of as a somewhat
weird and whacky thing, but it's conceivable that somebody someday might
want to write a Python program that speaks the Java serialization
protocol (I don't have a good sense of whether that's feasible).
Writing an application specific codec with the C API is doable in
principle, but it seems like an awful lot of effort for just one quickie
program. These indexes are very large and so writing the codec in
Python would probably be painfully slow.
msg66866 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008年05月15日 15:11
Since we also support oddball codecs like UTF-8-SIG, why not this one too?
Given the importance of UTF-8, it seems a good idea to support common
variations.
msg67368 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008年05月26日 09:36
Ok, if you can write a patch implementing the codec, we'll add it.
Please use the name "utf-8-java" and codec name utf_8_java.py.
msg123484 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年12月06日 18:28
> TCL only uses the codec for internal represenation. You might want to
> interface to TCL at the C level and use the codec there, but is that
> really a good reason to include the codec in the Python stdlib ?
I wonder if tkinter should use this encoding.
msg123770 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年12月11日 02:52
> I wonder if tkinter should use this encoding.
Tkinter is used to build graphical interfaces. I don't think that users write nul bytes with their keyboard. But there is maybe a use case?
msg135757 - (view) Author: Moese (moese) Date: 2011年05月11日 00:26
I use the hachoir Python package to parse Java .class files and extract the strings from them and having support for Java modified UTF-8 would have been nice.
msg135772 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011年05月11日 12:55
utf_8_java.patch: Implement "utf-8-java" encoding.
 * It has no alias
 * 'a0円b'.encode('utf-8-java') returns b'a\xc0\x80b'
 * b'a\xc0\x80b'.decode('utf-8-java') returns 'a\x00b'
 * I added some tests to utf-8 codec (test_invalid, test_null_byte)
 * I added many tests for utf-8-java codec
 * I choosed to copy utf8_code_length as utf8java_code_length instead of adding some if to not slow down UTF-8 codec
 * Decoder: 2 byte sequences may be *a little bit* slower for UTF-8:
"if ((s[1] & 0xc0) != 0x80)"
 is replaced by 
"if ((ch <= 0x007F && (ch != 0x0000 || !java)) || ch > 0x07FF)"
 * Encoder: encode chars in U+0000-U+007F may be *a little bit* slower for UTF-8: I added (ch == 0x00 && java) test
For the doc, I just added a line "utf-8-java" in the codec list, but I did not add a paragraph to explain how this codec is different to utf-8. Does anyone have a suggestion?
msg135776 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011年05月11日 13:31
Thanks for the patch, Victor.
Some comments on the patch:
 * the codec will have to be able to work with lone surrogates
 (see the wikipedia page explaining this detail), which the
 UTF-8 codec in Python 3.x no longer does, so another special
 case is due for this difference
 * we should not make the standard UTF-8 codec slower just to
 support a variant of UTF-8 which will only get marginal use;
 for the decoder, the changes are minimal, so that's fine,
 but for the decoder you are changing the most often used
 code branch to check for NUL bytes - we need a better solution
 for this, even if it means having to use a separte encode_utf8java
 function
Since the ticket was opened in 2008, the common name of the
codec appears to have changed from "UTF-8 Java" to "Modified UTF-8"
or "MUTF-8" as short alias:
 * http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
 (change in http://en.wikipedia.org/w/index.php?title=UTF-8&diff=next&oldid=291829304)
 * http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
 (scroll down to "Modified UTF-8")
 * http://developer.android.com/reference/java/io/DataInput.html
 (this is for Android)
So I guess we should adapt to the name to the now common name
and call it "ModifiedUTF8" in the C API and add these aliases:
"utf-8-modified", "mutf-8" and "modified-utf-8".
msg135796 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011年05月11日 19:00
See also issue #1028.
msg135797 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011年05月11日 19:47
Benchmark:
a) ./python -m timeit "(b'\xc3\xa9' * 10000).decode('utf-8')"
b)./python -m timeit "(''.join( map(chr, range(0, 128)) )*1000).encode('utf-8')"
c) ./python -m timeit "f=open('Misc/ACKS', encoding='utf-8'); acks=f.read(); f.close()" "acks.encode('utf-8')"
d) ./python -m timeit "f=open('Misc/ACKS', 'rb'); acks=f.read(); f.close()" "acks.decode('utf-8')"
Original -> patched (smallest value of 3 runs):
a) 85.8 usec -> 83.4 usec (-2.8%)
b) 548 usec -> 688 usec (+25.5%)
c) 132 usec -> 144 usec (+9%)
d) 65.9 usec -> 67.3 usec (+2.1%)
Oh, decode 2 bytes sequences are faster with my patch. Strange :-)
But 25% slower to encode a pure ASCII text is not a good news.
msg141938 - (view) Author: Tom Christiansen (tchrist) Date: 2011年08月12日 02:41
Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at:
 http://unicode.org/reports/tr26/
CESU-8 is *not* a a valid Unicode Transform Format and should not be called UTF-8. It is a real pain in the butt, caused by people who misunderand Unicode mis-encoding UCS-2 into UTF-8, screwing it up. I understand the need to be able to read it, but call it what it is, please.
Despite the talk about Lucene, I note that the Perl port of Lucene uses real UTF-8, not CESU-8.
msg141940 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2011年08月12日 05:37
+1 for calling it by the correct name (the docs can of course state that this is equivalent to "Java Modified UTF-8" or however they like to call it).
msg141949 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011年08月12日 10:26
Tom Christiansen wrote:
> 
> Tom Christiansen <tchrist@perl.com> added the comment:
> 
> Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at:
> 
> http://unicode.org/reports/tr26/
> 
> CESU-8 is *not* a a valid Unicode Transform Format and should not be called UTF-8. It is a real pain in the butt, caused by people who misunderand Unicode mis-encoding UCS-2 into UTF-8, screwing it up. I understand the need to be able to read it, but call it what it is, please.
> 
> Despite the talk about Lucene, I note that the Perl port of Lucene uses real UTF-8, not CESU-8.
CESU-8 is a different encoding than the one we are talking about.
The only difference between UTF-8 and the modified one is the different
encoding for the U+0000 code point to have the output not contain
any NUL bytes.
msg141955 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011年08月12日 13:35
Corrected the title again. See my comment.
msg141956 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011年08月12日 13:44
Marc-Andre Lemburg wrote:
> 
> Corrected the title again. See my comment.
Please open a new ticket, if you want to add a CESU-8 codec.
Looking at the relevant use cases, I'm at most +0 on adding the
modified UTF-8 codec. I think such codecs can well live outside
the stdlib on PyPI.
msg141957 - (view) Author: Moese (moese) Date: 2011年08月12日 13:57
Python does have other "weird" encodings like bz2 or rot13.
Beside, batteries included :)
msg142017 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011年08月13日 08:15
> Python does have other "weird" encodings like bz2 or rot13.
No, it has no more such weird encodings.
msg159130 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年04月24日 11:01
As far as I understand, this codec can be implemented in Python. There is no need to modify the interpreter core.
def decode_cesu8(b):
 return re.sub('[\uD800-\uDBFF][\uDC00\DFFF]', lambda m: chr(0x10000 | ((ord(m.group()[0]) & 0x3FF) << 10) | (ord(m.group()[1]) & 0x3FF)), b.decode('utf-8', 'surrogatepass'))
def encode_cesu8(s):
 return re.sub('[\U00010000-\U0010FFFF]', lambda m: chr(0xD800 | ((ord(m.group()) >> 10) & 0x3FF)) + chr(0xDC00 | (ord(m.group() & 0x3FF)), s).encode('utf-8', 'surrogatepass')
def decode_mutf8(b):
 return decode_cesu8(b.replace(b'\xC0\x80', b'\x00'))
def encode_mutf8(s):
 return encode_cesu8(s).replace(b'\x00', b'\xC0\x80')
msg159133 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012年04月24日 11:28
Serhiy: your functions to not constitute a Python codec. For example, there is no support for error handlers in them.
msg159136 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年04月24日 12:07
> Serhiy: your functions to not constitute a Python codec. For example, there is no support for error handlers in them.
Yes, it is not a codec in Python library terminology. It's just a pair
of functions, the COder and DECoder, which is enough for the task of
hacking Java serialized data. I don't think that such specific task
leads to the change of the interpreter core.
However, translators that convert the non-BMP characters to a surrogate
pair and back, would be useful in the standard library. They need to
work with a non-standard encodings (CESU-8, MUTF-8, cp65001, some
Tk/IDLE issues). This is a fairly common task.
msg159137 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012年04月24日 12:26
Ok, I'm closing this entire issue as "won't fix", then. There apparently is a need for functionality like this, but there is apparently also a concern that this is too specialized for the standard library.
As it is possible to implement this as a stand-alone library, I encourage interested users to design a package for PyPI that has this functionality collected for reuse. If the library is then widely used after some time, this issue can be reconsidered.
History
Date User Action Args
2022年04月11日 14:56:34adminsetgithub: 47106
2012年04月24日 12:26:37loewissetstatus: open -> closed
resolution: wont fix
messages: + msg159137
2012年04月24日 12:07:37serhiy.storchakasetmessages: + msg159136
2012年04月24日 11:28:53loewissetnosy: + loewis
messages: + msg159133
2012年04月24日 11:01:36serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg159130
2011年08月13日 08:15:10vstinnersetmessages: + msg142017
2011年08月12日 13:57:58moesesetmessages: + msg141957
2011年08月12日 13:44:07lemburgsetmessages: + msg141956
2011年08月12日 13:35:02lemburgsettitle: Add CESU-8 codec ("java modified utf-8") -> Add "java modified utf-8" codec
messages: + msg141955
versions: + Python 3.3, - Python 2.7, Python 3.2
2011年08月12日 13:14:43vstinnersettitle: add codec for java modified utf-8 -> Add CESU-8 codec ("java modified utf-8")
2011年08月12日 10:26:32lemburgsetmessages: + msg141949
2011年08月12日 05:37:33georg.brandlsetmessages: + msg141940
2011年08月12日 02:41:16tchristsetnosy: + tchrist
messages: + msg141938
2011年05月11日 19:47:54vstinnersetmessages: + msg135797
2011年05月11日 19:00:33vstinnersetmessages: + msg135796
2011年05月11日 13:31:59lemburgsetmessages: + msg135776
2011年05月11日 12:55:39vstinnersetfiles: + utf_8_java.patch
keywords: + patch
messages: + msg135772
2011年05月11日 00:26:26moesesetnosy: + moese
messages: + msg135757
2010年12月11日 02:52:51vstinnersetmessages: + msg123770
2010年12月06日 18:28:50belopolskysetnosy: + belopolsky
messages: + msg123484
2009年05月16日 19:39:18ajaksu2setnosy: + vstinner, ezio.melotti
versions: + Python 2.7, Python 3.2, - Python 2.5
priority: normal
components: + Unicode
type: enhancement
stage: test needed
2008年05月26日 09:36:39lemburgsetmessages: + msg67368
2008年05月15日 15:11:31georg.brandlsetnosy: + georg.brandl
messages: + msg66866
2008年05月15日 14:39:30phrsetmessages: + msg66862
2008年05月15日 11:54:18lemburgsetmessages: + msg66857
2008年05月15日 10:59:03phrsetmessages: + msg66855
2008年05月15日 10:56:05phrsetmessages: + msg66854
2008年05月15日 09:55:53lemburgsetnosy: + lemburg
messages: + msg66852
title: add coded for java modified utf-8 -> add codec for java modified utf-8
2008年05月15日 03:08:38phrcreate

AltStyle によって変換されたページ (->オリジナル) /