Message 242457 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	malin
Recipients	ezio.melotti, malin, vstinner
Date	2015年05月03日.08:16:58
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1430641019.11.0.12743813091.issue24117@psf.upfronthosting.co.za>

Content
Hi, There is a small bug in GB18030 decoder. For 4-byte sequence, the legal range is: 0x81-0xFE for the 1st byte 0x30-0x39 for the 2nd byte 0x81-0xFE for the 3rd byte 0x30-0x39 for the 4th byte The current code forgets to check 0xFE of the 1st and 3rd byte. Therefore, there are 8630 illegal 4-byte sequences can be decoded by GB18030 codec, here is an example: # legal sequence 0x81319130 is decoded to U+060A, it's fine. b = bytes([0x81, 0x31, 0x81, 0x30]) uchar = b.decode('gb18030') print(ord(uchar)) # illegal sequence 0x8130FF30 can be decoded to U+060A as well. b = bytes([0x81, 0x30, 0xFF, 0x30]) uchar = b.decode('gb18030') print(ord(uchar))

Content

Hi,
There is a small bug in GB18030 decoder.
For 4-byte sequence, the legal range is:
0x81-0xFE for the 1st byte
0x30-0x39 for the 2nd byte
0x81-0xFE for the 3rd byte
0x30-0x39 for the 4th byte
The current code forgets to check 0xFE of the 1st and 3rd byte.
Therefore, there are 8630 illegal 4-byte sequences can be decoded by GB18030 codec, here is an example:
# legal sequence 0x81319130 is decoded to U+060A, it's fine.
b = bytes([0x81, 0x31, 0x81, 0x30])
uchar = b.decode('gb18030')
print(ord(uchar))
# illegal sequence 0x8130FF30 can be decoded to U+060A as well.
b = bytes([0x81, 0x30, 0xFF, 0x30]) 
uchar = b.decode('gb18030')
print(ord(uchar))

History
Date	User	Action	Args
2015年05月03日 08:16:59	malin	set	recipients: + malin, vstinner, ezio.melotti
2015年05月03日 08:16:59	malin	set	messageid: <1430641019.11.0.12743813091.issue24117@psf.upfronthosting.co.za>
2015年05月03日 08:16:59	malin	link	issue24117 messages
2015年05月03日 08:16:58	malin	create

homepage