Issue 24117: Wrong range checking in GB18030 decoder.

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68305

classification

Title:	Wrong range checking in GB18030 decoder.
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.6, Python 3.5, Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:
Assigned To:	Nosy List:	Artoria2e5, ezio.melotti, hyeshik.chang, lemburg, loewis, malin, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2015年05月03日 08:16 by malin, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
forpy27.patch	malin, 2015年05月08日 03:25	for python 2.7 e8783c581928
forpy34.patch	malin, 2015年05月08日 03:27	for python 3.4 c79fd57f86b6	review
forpy35.patch	malin, 2015年05月08日 03:27	for python 3.5 1ea01f2c5b10	review

Messages (14)
msg242457 - (view)	Author: Ma Lin (malin) *	Date: 2015年05月03日 08:16
Hi, There is a small bug in GB18030 decoder. For 4-byte sequence, the legal range is: 0x81-0xFE for the 1st byte 0x30-0x39 for the 2nd byte 0x81-0xFE for the 3rd byte 0x30-0x39 for the 4th byte The current code forgets to check 0xFE of the 1st and 3rd byte. Therefore, there are 8630 illegal 4-byte sequences can be decoded by GB18030 codec, here is an example: # legal sequence 0x81319130 is decoded to U+060A, it's fine. b = bytes([0x81, 0x31, 0x81, 0x30]) uchar = b.decode('gb18030') print(ord(uchar)) # illegal sequence 0x8130FF30 can be decoded to U+060A as well. b = bytes([0x81, 0x30, 0xFF, 0x30]) uchar = b.decode('gb18030') print(ord(uchar))
msg242462 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2015年05月03日 11:13
Adding Hye-Shik who wrote the codec.
msg242586 - (view)	Author: Ma Lin (malin) *	Date: 2015年05月05日 00:52
I found another bug in hz codec. hz encoding uses 7-bit ASCII to represent Chinese characters, it was popular in USENET networks in the late 1980s and early 1990s. I will do more check and fix them together, then I will invite you to review the patch. u = 'hi~python' b = u.encode('hz') # bug in this step, the right sequence should be b"hi~~python" print(b) # the output is b"hi~python" u = b.decode('hz') # so can't decode, UnicodeDecodeError raised print(u)
msg242744 - (view)	Author: Ma Lin (malin) *	Date: 2015年05月08日 03:25
I examined all Chinese codecs, here are the patches, please review them, feel free to ask me your question. Thanks to Hye-Shik, your framework is very easy to understand :)
msg242745 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2015年05月08日 03:59
Do you have authoritative links that describe these standards?
msg242747 - (view)	Author: Ma Lin (malin) *	Date: 2015年05月08日 05:38
Good question. GB2312: I tested those programming languages one by one. GBK/CP936/GB18030-2000: I gathered data via Internet as much as I can, then compare them to Python3's codecs. I check key points with authoritative source, and verify every appeared conflicts. The data come from ICU, Unicode.org, IBM, Chinese researchers, and data found by Google. I had spent about half-month to do this, not just started from several days ago. I hope those descriptions will help late comers.
msg243467 - (view)	Author: Ma Lin (malin) *	Date: 2015年05月18日 09:07
>> I examined all Chinese codecs I said it above, but I forgot Taiwan and HongKong are using Chinese as well. BIG5 and CP950 are using a wrong convert table, test this: >>> u = b'\xC6\xA1'.decode('big5') >>> hex(ord(u)) '0x30fe' This should not happen, 0xC6A1 is neither in BIG5 nor in CP950. In BIG5-2003 and HKSCS-2008, 0xC6A1 is mapped to U+2460. I only had a look roughly, please check more. I won't check HongKong codec anymore, I suggest check it as well.
msg243468 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2015年05月18日 09:17
> The data come from ICU, Unicode.org, IBM, If you could provide links to the relevant pages/section we can verify that the codecs are indeed incorrect. Also keep in mind that there might people relying on these incorrectness, so we have to be careful while changing them. This is especially true if there are de-facto standards that diverge from the actual standards.
msg243555 - (view)	Author: Ma Lin (malin) *	Date: 2015年05月19日 04:30
This is not a de-facto standard, it should be fixed. I already posted this infomation on a Taiwan Python community, let's wait their inspection.
msg243559 - (view)	Author: Ma Lin (malin) *	Date: 2015年05月19日 06:28
>> If you could provide links to the relevant pages/section we can verify that the codecs are indeed incorrect. Here is CP950, 0xC6A1 is not in it. https://msdn.microsoft.com/zh-cn/goglobal/cc305155 I can provide one link, but there are many variants of BIG5 convert table on the Interenet, so one link doesn't bring persuasion. In this page: https://moztw.org/docs/big5/ Listed many variants of BIG5 tables, I found 0xC6A1<->U+30FE in this table "Unicode 1.1", the description of it is "it's a terrible table, many errors exist, sadlly many foreigners are using it", but IIRC Python's BIG5 codec is not fully same as that table. IMO, the most reliable way is reading a lot of stuff, and verify the key points and conflicts with authoritative source, but this way is very hard for foreigners. Anyway, let's wait Taiwanese and their opinion for whether this should be fixed.
msg257335 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2016年01月02日 08:20
Did you hear anything back from them?
msg257339 - (view)	Author: Ma Lin (malin) *	Date: 2016年01月02日 09:22
I posted in a Taiwanese forum: https://groups.google.com/forum/#!forum/pythontw no reply yet.
msg280812 - (view)	Author: Mingye Wang (Artoria2e5) *	Date: 2016年11月14日 20:25
Just FYI, cp950 0xC6A1 (\uf6b1) is found in current WindowsBestFit: ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt
msg291316 - (view)	Author: Ma Lin (malin) *	Date: 2017年04月08日 04:13
I closed this issue, because it involved too many things. 1, for GB18030 decoder bug, see issue29990. 2, for hz encoder bug, see issue30003. 3, for problem in Traditional Chinese codecs, please create a new issue.

History
Date	User	Action	Args
2022年04月11日 14:58:16	admin	set	github: 68305
2017年04月08日 04:13:43	malin	set	messages: + msg291316
2017年04月05日 03:57:36	malin	set	status: open -> closed stage: patch review -> resolved
2016年11月14日 20:25:37	Artoria2e5	set	nosy: + Artoria2e5 messages: + msg280812
2016年01月02日 09:22:31	malin	set	messages: + msg257339
2016年01月02日 08:20:33	ezio.melotti	set	messages: + msg257335 versions: + Python 3.6, - Python 3.4
2015年05月19日 06:28:24	malin	set	messages: + msg243559
2015年05月19日 04:30:13	malin	set	messages: + msg243555
2015年05月18日 09:17:52	ezio.melotti	set	messages: + msg243468
2015年05月18日 09:07:57	malin	set	messages: + msg243467
2015年05月08日 05:38:46	malin	set	messages: + msg242747
2015年05月08日 03:59:52	ezio.melotti	set	messages: + msg242745
2015年05月08日 03:27:50	malin	set	files: + forpy35.patch
2015年05月08日 03:27:17	malin	set	files: + forpy34.patch
2015年05月08日 03:26:01	malin	set	files: - forpy27.patch
2015年05月08日 03:25:58	malin	set	files: - forpy3.patch
2015年05月08日 03:25:37	malin	set	files: + forpy27.patch messages: + msg242744
2015年05月05日 00:52:20	malin	set	messages: + msg242586
2015年05月03日 11:13:00	lemburg	set	nosy: + hyeshik.chang messages: + msg242462
2015年05月03日 08:47:40	serhiy.storchaka	set	nosy: + lemburg, loewis, serhiy.storchaka stage: patch review
2015年05月03日 08:31:41	malin	set	title: A small bug in GB18030 decoder. -> Wrong range checking in GB18030 decoder.
2015年05月03日 08:17:37	malin	set	files: + forpy3.patch type: behavior
2015年05月03日 08:16:59	malin	create

homepage