Issue1066
Created on 2008年06月26日.00:04:03 by pjenvey, last changed 2014年06月23日.17:53:46 by zyasoft.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | Remove |
| cjkcodecs-patch-20120907-1335 | yyamano, 2012年09月10日.04:23:37 | |||
| shift_jis.patch | zyasoft, 2014年06月12日.21:53:38 | |||
| Messages | |||
|---|---|---|---|
| msg3306 (view) | Author: Philip Jenvey (pjenvey) | Date: 2008年06月26日.00:04:02 | |
CPython 2.4 included the CJKCodecs package: http://cjkpython.i18n.org/ which provides codecs for chinese/japanese/korean etc charsets, implemented in C. The lack of these codecs causes these tests to fail: + test_codecencodings_cn + test_codecencodings_hk + test_codecencodings_jp + test_codecencodings_kr + test_codecencodings_tw + test_codecmaps_cn + test_codecmaps_hk + test_codecmaps_jp + test_codecmaps_kr + test_codecmaps_tw |
|||
| msg3307 (view) | Author: Philip Jenvey (pjenvey) | Date: 2008年06月26日.00:08:07 | |
cjk also includes the _multibytecodec module, which affects these tests: test_multibytecodec test_multibytecodec_support |
|||
| msg3879 (view) | Author: Philip Jenvey (pjenvey) | Date: 2008年12月08日.05:27:14 | |
We should utilize the nio charsets for these. One gotcha is they encode to/decode from actual bytes, not chars (as they should) -- and of course our byte bucket (str) is based on chars. In that case we could probably make the streaming from/to our 'byte bucket' more efficient by faking a ByteBuffer that gave back bytes from/put back bytes to an underlying char array. That'd avoid an extra conversion pass. The Encoder/Decoder implementations seem to go through the actual ByteBuffer methods -- i.e. not through the underlying Buffer arrays directly. That'd allow this hack A CharsetDecoder can take a ByteBuffer instance to fill into -- we'd have to use that for this hack, since Charset.encode returns an entirely new ByteBuffer This hack would be kind of a lame, but would go away in Jython 3. Or we could just do the extra pass Another gotcha would be -- can we still retain our error handling behavior with Java's Charsets? Briefly looking at them, they seem to have fairly similar error handling facilities |
|||
| msg3880 (view) | Author: Philip Jenvey (pjenvey) | Date: 2008年12月08日.05:58:23 | |
Java supports most of the cjkcodecs but not these: cp932 (mskanji) euc_jis_2004 (Japanese) euc_jisx0213 (Japanese) hz (Simplified Chinese) iso2022_jp_1 (iso2022 variants) iso2022_jp_2 iso2022_jp_2004 iso2022_jp_3 iso2022_jp_ext shift_jis_2004 (Shiftjis variants) shift_jisx0213 Determined from: http://java.sun.com/j2se/1.5.0/docs/guide/intl/encoding.doc.html and: $ grep getcodec * big5.py:codec = _codecs_tw.getcodec('big5') big5hkscs.py:codec = _codecs_hk.getcodec('big5hkscs') cp932.py:codec = _codecs_jp.getcodec('cp932') cp949.py:codec = _codecs_kr.getcodec('cp949') cp950.py:codec = _codecs_tw.getcodec('cp950') euc_jis_2004.py:codec = _codecs_jp.getcodec('euc_jis_2004') euc_jisx0213.py:codec = _codecs_jp.getcodec('euc_jisx0213') euc_jp.py:codec = _codecs_jp.getcodec('euc_jp') euc_kr.py:codec = _codecs_kr.getcodec('euc_kr') gb18030.py:codec = _codecs_cn.getcodec('gb18030') gb2312.py:codec = _codecs_cn.getcodec('gb2312') gbk.py:codec = _codecs_cn.getcodec('gbk') hz.py:codec = _codecs_cn.getcodec('hz') iso2022_jp.py:codec = _codecs_iso2022.getcodec('iso2022_jp') iso2022_jp_1.py:codec = _codecs_iso2022.getcodec('iso2022_jp_1') iso2022_jp_2.py:codec = _codecs_iso2022.getcodec('iso2022_jp_2') iso2022_jp_2004.py:codec = _codecs_iso2022.getcodec('iso2022_jp_2004') iso2022_jp_3.py:codec = _codecs_iso2022.getcodec('iso2022_jp_3') iso2022_jp_ext.py:codec = _codecs_iso2022.getcodec('iso2022_jp_ext') iso2022_kr.py:codec = _codecs_iso2022.getcodec('iso2022_kr') johab.py:codec = _codecs_kr.getcodec('johab') shift_jis.py:codec = _codecs_jp.getcodec('shift_jis') shift_jis_2004.py:codec = _codecs_jp.getcodec('shift_jis_2004') shift_jisx0213.py:codec = _codecs_jp.getcodec('shift_jisx0213') |
|||
| msg4243 (view) | Author: Jim Baker (zyasoft) | Date: 2009年03月12日.08:21:29 | |
Deferred to 2.5.1 |
|||
| msg4992 (view) | Author: Charlie Groves (cgroves) | Date: 2009年08月05日.16:50:20 | |
When I looked at this, the nio charsets have similar default error handlers, but there's no way to make custom ones. I think that rules using these charsets out with python, since codecs picked up the ability to use a user-defined error handling function in 2.3. It has been a couple years since I looked at this though, so I may be misremembering things. |
|||
| msg5020 (view) | Author: Philip Jenvey (pjenvey) | Date: 2009年08月12日.07:07:21 | |
Actually it seems like we could do callable error handlers via nio's report error action. That would make the encoder/decoder return a CoderResult upon failure but without resetting its state So we should be able to create a UnicodeError with its start/end/reason info from that CoderResult and the input Buffer (to pass to our error handler). Then we act upon the handler's result, restarting the encoder/decoder from where it left off if necessary |
|||
| msg5027 (view) | Author: Charlie Groves (cgroves) | Date: 2009年08月15日.20:03:22 | |
Ahh, that does sound workable. Nice! |
|||
| msg6055 (view) | Author: Jim Baker (zyasoft) | Date: 2010年09月09日.05:48:16 | |
Let's see if we can write wrappers of NIO in time for 2.5.2. |
|||
| msg6202 (view) | Author: Jim Baker (zyasoft) | Date: 2010年10月22日.22:20:52 | |
I'm going to try to get this into 2.5.2rc2, so marking high. I think I know the APIs respectively well enough now to write a pure Jython version that leverages java.nio, following Phil's suggestion. |
|||
| msg6216 (view) | Author: Jim Baker (zyasoft) | Date: 2010年11月01日.15:25:14 | |
This will not make 2.5.2 unless there's a RC3. I recommend we should release as a separate package on PyPI. Because of how one needs to do the buffering, it's necessary to use Java to manage the loop for reasonable performance. |
|||
| msg6483 (view) | Author: Philip Jenvey (pjenvey) | Date: 2011年04月13日.20:19:58 | |
FYI Yuji Yamano made some good progress on this task during the PyCon '11 sprint. He actually got it to the point that you could begin encoding asian characters via the codecs module. I have a preliminary patch from him in a pastebin but I'm sure he'll eventually send us a later version of this patch, and then maybe we can get this in for 2.6 |
|||
| msg7456 (view) | Author: Yuji Yamano (yyamano) | Date: 2012年09月10日.04:23:37 | |
Here is the work in progess patch for the svn trunk. * Some tests don't pass yet. * There are still some problems, but I don't remember exectly:-< * Too many debug log. |
|||
| msg7552 (view) | Author: Jeff Allen (jeff.allen) | Date: 2012年12月27日.14:57:00 | |
These codecs have become standard in Python 2.7 so the updated test_codecs regression test now fails (or acquires skips). Note related issue #2000. I observe that Python 2.7 has given us *codecs* for the missing asian script encodings but they depend on built-in modules I assume Yuji's patch aims to provide. Is anyone competent and willing to review the patch? |
|||
| msg7557 (view) | Author: Yuji Yamano (yyamano) | Date: 2012年12月28日.01:38:12 | |
I'm working on syncing the patch with the latest jython. See https://bitbucket.org/yyamano/jython/src/89bbdf124e6b/?at=issue1066 |
|||
| msg8338 (view) | Author: Jim Baker (zyasoft) | Date: 2014年05月05日.20:18:58 | |
Yuji, what's the status of your branch to provide this functionality? Would it be possible to have this synced against Jython trunk? For such syncing, please note the bitbucket mirror is currently down and has been in that state for the last couple of months; see https://bitbucket.org/site/master/issue/9315/https-bitbucketorg-jython-jython-no-longer, so you will need to sync with hg.python.org/jython |
|||
| msg8339 (view) | Author: Jim Baker (zyasoft) | Date: 2014年05月06日.23:55:27 | |
Targeting beta 4 of 2.7; required for work on https://github.com/html5lib/html5lib-python/pull/150 |
|||
| msg8495 (view) | Author: Jim Baker (zyasoft) | Date: 2014年05月21日.23:02:38 | |
Currently working on this with the assumption we will use CoderResult for error management |
|||
| msg8628 (view) | Author: Jim Baker (zyasoft) | Date: 2014年06月12日.03:50:51 | |
I've started to make good progress, using shift_jis as a representative encoding. About 1/4 of the shift_jis tests now pass in test_codecencodings_jp, which seems to be pretty good considering this is mostly covering various error cases. Given chunking, I suspect we can keep this in Python for now, although we can revisit at a later time. |
|||
| msg8635 (view) | Author: Jim Baker (zyasoft) | Date: 2014年06月12日.21:52:42 | |
Completed patch for shift_jis - all shift_jis tests pass in test_codecencodings_jp assuming that the following is changed from using a surrogate (not supportable in Jython unicode) in test_multibytecodec_support.py: unmappedunicode = u'\ufffe' The next step will be to register all encodings available in Java, ideally without a lot of boilerplate. |
|||
| msg8637 (view) | Author: Jeff Allen (jeff.allen) | Date: 2014年06月13日.07:07:19 | |
Congratulations on the progress. 0xfffe is a codepoint that is not a character (but it's not technically a surrogate). http://www.unicode.org/charts/PDF/UFFF0.pdf Is a unicode object a sequence of code points? Controversial area. |
|||
| msg8640 (view) | Author: Jim Baker (zyasoft) | Date: 2014年06月14日.00:46:28 | |
Fixed in http://hg.python.org/jython/rev/6c718e5e9ae9 to the extent possible by using java.nio.charset.Charset. Here are the codecs not available, more or less what Philip identified in msg3880: euc_jis_2004 euc_jisx0213 hz iso2022_jp_1 iso2022_jp_2004 iso2022_jp_3 iso2022_jp_ext shift_jis_2004 hz could potentially be supported by preprocessing - it's a way of encoding GB2312 as 2 7-bit bytes, with escaping provided by ~{...~}. It's possible that ICU4J could potentially help as well. We also potentially gain other encodings as well, such as cp1047, as needed by http://bugs.jython.org/issue550200, supporting EBCDIC. The one remaining issue I see here is that there are a couple of minor corner cases around errors for trailing bytes where it is not final. It's not clear to me what can really be done here in this case, since it seems to be a property of the decoder; at the very least it's something that's picked up by our unit tests, so it's visible. |
|||
| msg8644 (view) | Author: Jeff Allen (jeff.allen) | Date: 2014年06月14日.22:35:32 | |
I get test failures from test_email and test_email_renamed about the decoding of euc-jp. In a sense this is an improvement, since that bit of the test is skipped if there is no such codec. But now there is ...
======================================================================
FAIL: test_body_encode (email.test.test_email.TestCharset)
----------------------------------------------------------------------
Traceback (most recent call last):
File "D:\hg\jython-int\dist\Lib\email\test\test_email.py", line 2981, in test_body_encode
eq('\x1b$B5FCO;~IW\x1b(B',
File "D:\hg\jython-int\dist\Lib\email\test\test_email.py", line 2981, in test_body_encode
eq('\x1b$B5FCO;~IW\x1b(B',
AssertionError: '\x1b$B5FCO;~IW\x1b(B' != '\x1b$B5FCO;~IW'
Is this the same for you?
|
|||
| msg8805 (view) | Author: Jim Baker (zyasoft) | Date: 2014年06月23日.17:53:46 | |
Jeff, I'm seeing the same issues in test_email, but we will fix separately. Since it's in the regrtest, we see it every time it fails. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2014年06月23日 17:53:46 | zyasoft | set | messages: + msg8805 |
| 2014年06月23日 17:52:16 | zyasoft | set | status: pending -> closed |
| 2014年06月14日 22:35:33 | jeff.allen | set | messages: + msg8644 |
| 2014年06月14日 00:46:29 | zyasoft | set | status: open -> pending resolution: accepted -> fixed messages: + msg8640 |
| 2014年06月13日 07:07:19 | jeff.allen | set | messages: + msg8637 |
| 2014年06月12日 21:53:39 | zyasoft | set | files: + shift_jis.patch |
| 2014年06月12日 21:53:17 | zyasoft | set | files: - shift_jis.patch |
| 2014年06月12日 21:52:42 | zyasoft | set | messages: + msg8635 |
| 2014年06月12日 03:50:52 | zyasoft | set | files:
+ shift_jis.patch keywords: + patch messages: + msg8628 |
| 2014年05月21日 23:02:38 | zyasoft | set | messages: + msg8495 |
| 2014年05月07日 22:45:14 | jeff.allen | link | issue2123 dependencies |
| 2014年05月06日 23:55:28 | zyasoft | set | assignee: zyasoft resolution: accepted messages: + msg8339 |
| 2014年05月05日 20:18:58 | zyasoft | set | assignee: zyasoft -> (no value) messages: + msg8338 |
| 2013年07月03日 04:10:12 | pjenvey | link | issue2065 dependencies |
| 2013年02月20日 00:28:23 | fwierzbicki | set | versions: + Jython 2.7, - 2.5.1, 2.7a1, 2.7a2 |
| 2012年12月28日 01:38:12 | yyamano | set | messages: + msg7557 |
| 2012年12月27日 14:57:00 | jeff.allen | set | nosy:
+ jeff.allen messages: + msg7552 components: + Library versions: + 2.7a1, 2.7a2 |
| 2012年09月10日 04:23:38 | yyamano | set | files:
+ cjkcodecs-patch-20120907-1335 messages: + msg7456 |
| 2011年04月13日 20:19:59 | pjenvey | set | nosy:
+ yyamano messages: + msg6483 |
| 2010年11月01日 15:25:14 | zyasoft | set | messages: + msg6216 |
| 2010年10月22日 22:20:52 | zyasoft | set | priority: normal -> high messages: + msg6202 |
| 2010年09月09日 05:48:16 | zyasoft | set | priority: low -> normal messages: + msg6055 |
| 2009年08月15日 20:03:22 | cgroves | set | messages: + msg5027 |
| 2009年08月12日 07:07:21 | pjenvey | set | messages: + msg5020 |
| 2009年08月05日 16:50:21 | cgroves | set | nosy:
+ cgroves messages: + msg4992 |
| 2009年08月05日 14:35:19 | fwierzbicki | set | nosy: + fwierzbicki |
| 2009年03月21日 13:04:14 | zyasoft | set | priority: low |
| 2009年03月12日 08:21:29 | zyasoft | set | messages:
+ msg4243 versions: + 2.5.1, - 2.5alpha1 |
| 2008年12月08日 05:58:25 | pjenvey | set | messages: + msg3880 |
| 2008年12月08日 05:27:23 | pjenvey | set | messages: + msg3879 |
| 2008年10月26日 18:55:12 | zyasoft | set | assignee: zyasoft |
| 2008年10月14日 17:43:53 | zyasoft | set | title: Need CJKCodecs for CPython 2.4 -> Need CJKCodecs - multibytecodecs |
| 2008年10月14日 17:43:13 | zyasoft | set | nosy: + zyasoft |
| 2008年06月26日 00:08:07 | pjenvey | set | messages: + msg3307 |
| 2008年06月26日 00:04:03 | pjenvey | create | |
Supported by Python Software Foundation,
Powered by Roundup