Issue 14850: The inconsistency of codecs.charmap_decode

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/59055

classification

Title:	The inconsistency of codecs.charmap_decode
Type:	behavior	Stage:	resolved
Components:	Interpreter Core, Unicode	Versions:	Python 3.2, Python 3.3, Python 3.4, Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	serhiy.storchaka	Nosy List:	doerwalter, eric.araujo, ezio.melotti, lemburg, loewis, python-dev, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	needs review, patch

Created on 2012年05月18日 14:46 by serhiy.storchaka, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
decode_charmap_fffe-3.3.patch	serhiy.storchaka, 2012年10月02日 16:17	Patch for 3.3 and 3.4	review
decode_charmap_fffe-3.2.patch	serhiy.storchaka, 2012年10月02日 16:17	Patch for 3.2	review
decode_charmap_fffe-2.7.patch	serhiy.storchaka, 2012年10月02日 16:17	Patch for 2.7	review

Messages (14)
msg161054 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年05月18日 14:46
codecs.charmap_decode behaves differently with native and user string as decode table. >>> import codecs >>> print(ascii(codecs.charmap_decode(b'\x00', 'replace', '\uFFFE'))) ('\ufffd', 1) >>> class S(str): pass ... >>> print(ascii(codecs.charmap_decode(b'\x00', 'replace', S('\uFFFE')))) ('\ufffe', 1) It's because charmap decoder (function PyUnicode_DecodeCharmap in Objects/unicodeobject.c) uses different algorithms for exact strings and for other. We need to fix it? If yes, what should return `codecs.charmap_decode(b'\x00', 'replace', {0:'\uFFFE'})`? What should return `codecs.charmap_decode(b'\x00', 'replace', {0:0xFFFE})`?
msg162574 - (view)	Author: Éric Araujo (eric.araujo) * (Python committer)	Date: 2012年06月10日 01:34
What is the use case for passing a string subclass to charmap_decode? Or in other words, how did you stumble upon the bug?
msg162580 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2012年06月10日 07:59
U+FFFE is documented as representing an undefined mapping, see http://docs.python.org/dev/c-api/unicode.html?highlight=charmap#PyUnicode_DecodeCharmap So the base string case is correct; the derived string implementation also needs to invoke the error handler.
msg162582 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年06月10日 10:51
> What is the use case for passing a string subclass to charmap_decode? Or in other words, how did you stumble upon the bug? I stumbled upon it, rewriting the charmap decoder (issue14874). Now charmap decoder processes the two cases -- a more effective case of string table and a general slower case of general mapping. I proposed a more optimized case of 256-character UCS2 string (covers all standard charmap encodings). If processing general strings and maps was consistent, these cases can be merged. A string subclass is just an example that illustrates the inconsistency.
msg162583 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年06月10日 11:00
> U+FFFE is documented as representing an undefined mapping, Yes, using U+FFFE for representing an undefined mapping in strings is normal, the question was about string subclasses. And if we will correct it for string subclasses, how far we go any further? How about general mapping?
msg162584 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2012年06月10日 11:26
>> U+FFFE is documented as representing an undefined mapping, > > Yes, using U+FFFE for representing an undefined mapping in strings is > normal, the question was about string subclasses. What is the question? U+FFFE also represents an undefined mapping in string subclasses. > And if we will correct it for string subclasses, how far we go any > further? This is a single issue, a single bug. If the bug is fixed, it is fixed. No need to go further (unless there is another bug somewhere).
msg162588 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年06月10日 15:59
> What is the question? U+FFFE also represents an undefined mapping in > string subclasses. What about classes that not subclassed string but ducktyped string by implementing all string method? What about list/tuple/array.array of integers or 1-character strings? What about general mapping? Should any of them have 0xFFFE or '\uFFFE' represent an undefined mapping? > This is a single issue, a single bug. If the bug is fixed, it is fixed. > No need to go further (unless there is another bug somewhere). My question, where is the limit of this bug.
msg162589 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2012年06月10日 17:07
> integers or 1-character strings? What about general mapping? Should > any of them have 0xFFFE or '\uFFFE' represent an undefined mapping? The documentation says that the parameter "can be a dictionary mapping byte or a unicode string, which is treated as a lookup table". So anything that supports GetItem with a small integer index can be passed. It then says '... U+FFFE "characters" are treated as "undefined mapping"'. So the answer to your last question is "yes". I hope that the answer to your other questions follows from that (strictly speaking, it's only U+FFFE, not 0xFFFE, that is documented as indicating an undefined mapping; a patch should probably fix that). (I also wonder where the support for LookupError comes from - that appears to be undocumented)
msg162594 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年06月10日 18:51
> So the answer to your last question is "yes". I hope that the answer to > your other questions follows from that Thank you, this is the answer to all my questions. I've prepared a patch to treat U+FFFE in general mapping as "undefined mapping". > (strictly speaking, it's only > U+FFFE, not 0xFFFE, that is documented as indicating an undefined > mapping; a patch should probably fix that). As both integer 0xXXXX and string '\uXXXX' denote U+XXXX, I do not think it necessary fixes. > (I also wonder where the support for LookupError comes from - that > appears to be undocumented) I believe, this is what is meant by the words "undefined mapping".
msg171813 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月02日 16:17
Patch updated to resolve conflict with issue15379. Added tests. Added patches for 3.2 and 2.7.
msg173357 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月19日 19:10
Does anyone have objections against the idea or the implementation of the patch? Please review.
msg178321 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年12月27日 20:36
I no one objects I will commit this next year.
msg180017 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2013年01月15日 13:40
New changeset 33a8ef498b1e by Serhiy Storchaka in branch '2.7': Issue #14850: Now a chamap decoder treates U+FFFE as "undefined mapping" http://hg.python.org/cpython/rev/33a8ef498b1e New changeset 13cd78a2a17b by Serhiy Storchaka in branch '3.2': Issue #14850: Now a chamap decoder treates U+FFFE as "undefined mapping" http://hg.python.org/cpython/rev/13cd78a2a17b New changeset 6ac4f1609847 by Serhiy Storchaka in branch '3.3': Issue #14850: Now a chamap decoder treates U+FFFE as "undefined mapping" http://hg.python.org/cpython/rev/6ac4f1609847 New changeset 03e22cc9407a by Serhiy Storchaka in branch 'default': Issue #14850: Now a chamap decoder treates U+FFFE as "undefined mapping" http://hg.python.org/cpython/rev/03e22cc9407a
msg180019 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2013年01月15日 14:08
Fixed. Thank you for your answers, Martin.

History
Date	User	Action	Args
2022年04月11日 14:57:30	admin	set	github: 59055
2013年01月15日 14:09:39	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2013年01月15日 14:08:56	serhiy.storchaka	set	messages: + msg180019
2013年01月15日 13:40:02	python-dev	set	nosy: + python-dev messages: + msg180017
2012年12月27日 20:48:27	serhiy.storchaka	set	assignee: serhiy.storchaka
2012年12月27日 20:36:51	serhiy.storchaka	set	messages: + msg178321
2012年10月24日 09:05:58	serhiy.storchaka	set	stage: patch review
2012年10月19日 23:38:00	pitrou	set	nosy: + vstinner
2012年10月19日 19:10:20	serhiy.storchaka	set	messages: + msg173357
2012年10月02日 16:20:25	serhiy.storchaka	set	keywords: + needs review components: + Unicode versions: + Python 3.4
2012年10月02日 16:17:31	serhiy.storchaka	set	files: + decode_charmap_fffe-3.3.patch, decode_charmap_fffe-3.2.patch, decode_charmap_fffe-2.7.patch messages: + msg171813
2012年10月02日 15:05:15	serhiy.storchaka	set	files: - decode_charmap_fffe.patch
2012年06月16日 16:31:55	pitrou	set	nosy: + ezio.melotti
2012年06月10日 18:51:28	serhiy.storchaka	set	files: + decode_charmap_fffe.patch keywords: + patch messages: + msg162594
2012年06月10日 17:07:59	loewis	set	messages: + msg162589
2012年06月10日 15:59:14	serhiy.storchaka	set	messages: + msg162588
2012年06月10日 11:26:39	loewis	set	messages: + msg162584
2012年06月10日 11:00:13	serhiy.storchaka	set	messages: + msg162583
2012年06月10日 10:51:16	serhiy.storchaka	set	messages: + msg162582
2012年06月10日 07:59:24	loewis	set	messages: + msg162580
2012年06月10日 01:34:04	eric.araujo	set	nosy: + eric.araujo messages: + msg162574
2012年05月18日 20:59:10	terry.reedy	set	nosy: + lemburg, doerwalter
2012年05月18日 14:53:23	pitrou	set	nosy: + loewis
2012年05月18日 14:46:35	serhiy.storchaka	create

homepage