Issue 11489: json.dumps not parsable by json.loads (on Linux only)

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/55698

classification

Title:	json.dumps not parsable by json.loads (on Linux only)
Type:	behavior	Stage:	resolved
Components:	Extension Modules, Library (Lib), Unicode, Windows	Versions:	Python 3.3, Python 3.4, Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	serhiy.storchaka	Nosy List:	Arfrever, Brian.Merrell, belopolsky, bob.ippolito, ezio.melotti, merrellb, petri.lehtinen, pitrou, python-dev, rhettinger, serhiy.storchaka, taras.prokopenko, tchrist, vstinner
Priority:	normal	Keywords:	patch

Created on 2011年03月13日 23:17 by Brian.Merrell, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue11489.diff	ezio.melotti, 2012年10月01日 00:46	Failing test (2.7)
json_decode_lone_surrogates_2.patch	serhiy.storchaka, 2013年05月12日 19:25	Patch for 3.4	review
json_decode_lone_surrogates_2-2.7.patch	serhiy.storchaka, 2013年05月12日 19:26	Patch for 2.7	review
json_decode_lone_surrogates_3-3.4.patch	serhiy.storchaka, 2013年11月20日 11:47	review
test_json_surrogates.patch	serhiy.storchaka, 2013年12月01日 11:03	review

Messages (21)
msg130779 - (view)	Author: Brian Merrell (Brian.Merrell)	Date: 2011年03月13日 23:17
The following works on Win7x64 Python 2.6.5 and breaks on Ubuntu 10.04x64-2.6.5. This raises three issues: 1) Shouldn't anything generated by json.dumps be parsed by json.loads? 2) It appears this is an invalid unicode character. Shouldn't this be caught by decode("utf8") 3) Why does Windows raise no issue with this and Linux does? import json unicode_bytes = '\xed\xa8\x80' unicode_string = unicode_bytes.decode("utf8") json_encoded = json.dumps("my_key":unicode_string) json.loads(json_encoded) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.6/json/__init__.py", line 307, in loads return _default_decoder.decode(s) File "/usr/lib/python2.6/json/decoder.py", line 319, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python2.6/json/decoder.py", line 336, in raw_decode obj, end = self._scanner.iterscan(s, **kw).next() File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan rval, next_pos = action(m, context) File "/usr/lib/python2.6/json/decoder.py", line 183, in JSONObject value, end = iterscan(s, idx=end, context=context).next() File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan rval, next_pos = action(m, context) File "/usr/lib/python2.6/json/decoder.py", line 155, in JSONString return scanstring(match.string, match.end(), encoding, strict) ValueError: Invalid \uXXXX escape: line 1 column 14 (char 14)
msg130846 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2011年03月14日 16:19
> It appears this is an invalid unicode character. > Shouldn't this be caught by decode("utf8") It should and it is in Python 3.x: >>> b'\xed\xa8\x80'.decode("utf8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte Python 2.7 behavior seems to be a bug. >>> '\xed\xa8\x80'.decode("utf8") u'\uda00' Note also the following difference: In 3.x: >>> b'\xed\xa8\x80'.decode("utf8", 'replace') '��' In 2.7: >>> '\xed\xa8\x80'.decode("utf8", 'replace') u'\uda00' I am not sure this should be fixed in 2.x. Lone surrogates seem to round-trip just fine in 2.x and there likely to be existing code that relies on this. > Shouldn't anything generated by json.dumps be parsed by json.loads? This on the other hand should probably be fixed by either rejecting lone surrogates in json.dumps or accepting them in json.loads or both. The last alternative would be consistent with the common wisdom of being conservative in what you produce but liberal in what you accept.
msg130862 - (view)	Author: Brian Merrell (Brian.Merrell)	Date: 2011年03月14日 17:31
>I am not sure this should be fixed in 2.x. Lone surrogates seem to >round-trip just fine in 2.x and there likely to be existing code that >relies on this. I generally agree but am then at a loss as to how to detect and deal with lone surrogates(eg "ignore", "replace", etc) in 2.x when interacting with services/libraries (such as Python's own json.loads) that take a stricter view. >> Shouldn't anything generated by json.dumps be parsed by json.loads? >This on the other hand should probably be fixed by either rejecting >lone surrogates in json.dumps or accepting them in json.loads or both. >The last alternative would be consistent with the common wisdom of >being conservative in what you produce but liberal in what you accept. We seem to be in the worst of both worlds right now as I've generated and stored a lot of json that can not be read back in. Could the JSON library simply leverage Python's Unicode interpreter instead of performing its own validation? We could pass it "ignore", "replace", etc. Regardless, I think we certainly need to remove the strict JSON loads() validation especially when it isn't enforced by dumps().
msg130889 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2011年03月14日 20:09
> We seem to be in the worst of both worlds right now > as I've generated and stored a lot of json that can > not be read back in This is unfortunate. The dumps() should have never worked in the first place. I don't think that loads() should be changed to accommodate the dumps() error though. JSON is UTF-8 by definition and it is a useful feature that invalid UTF-8 won't load. To fix the data you've already created (one that other compliant JSON readers wouldn't be able to parse), I think you need to repreprocess those file to make them valid: bs.decode('utf-8', errors='ignore').encode('utf-8') Then we need to fix dumps so that it doesn't silently create invalid JSON. > This on the other hand should probably be > fixed by either rejecting lone surrogates > in json.dumps or accepting them in json.loads or both. Rejection is the right way to go. For the most part, it is never helpful to create invalid JSON files that other readers can't and shouldn't read.
msg130891 - (view)	Author: Brian (merrellb)	Date: 2011年03月14日 20:21
On Mon, Mar 14, 2011 at 4:09 PM, Raymond Hettinger <report@bugs.python.org>wrote: > > Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment: > > > We seem to be in the worst of both worlds right now > > as I've generated and stored a lot of json that can > > not be read back in > > This is unfortunate. The dumps() should have never worked in the first > place. > > I don't think that loads() should be changed to accommodate the dumps() > error though. JSON is UTF-8 by definition and it is a useful feature that > invalid UTF-8 won't load. > I may be wrong but it appeared that json actually encoded the data as the string "u\da00" ie (6-bytes) which is slightly different than the encoding of the utf-8 encoding of the json itself. Not sure if this is relevant but it seems less severe than actually invalid utf-8 coding in the bytes. Unfortunately I don't believe this does anything on python 2.x as only python 3.x encode/decode flags this as invalid. > ---------- > nosy: +rhettinger > priority: normal -> high > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue11489> > _______________________________________ >
msg133662 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2011年04月13日 12:37
print(repr(json.loads(json.dumps({u"my_key": u'\uda00'}))['my_key'])): - displays u'\uda00' in Python 2.7, 3.2 and 3.3 - raises a ValueError('Invalid \uXXXX escape: ...') on loads() in Python 2.6 - raises a ValueError('Unpaired high surrogate: ...') on loads() in Python 3.1 json version changed in Python 2.7: see the issue #4136. See also this important change in simplejson: http://code.google.com/p/simplejson/source/detail?r=113 We only fix security bugs in Python 2.6, not bugs. I don't think that this issue is a security bug in Python 2.6. We might change Python 3.1 behaviour.
msg144646 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年09月29日 22:22
RFC 4627 doesn't say much about lone surrogates: A string is a sequence of zero or more Unicode characters [UNICODE]. [...] All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F). Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C". [...] To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". Raymond> JSON is UTF-8 by definition and it is a useful feature that invalid UTF-8 won't load. Even if the input strings are not encodable in UTF-8 because they contain lone surrogates, they can still be converted to an \uXXXX escape, and the resulting JSON document will be valid UTF-8. AFAIK json always uses \uXXXX, so it doesn't produce invalid UTF-8 documents. While decoding, both json.loads('"\xed\xa0\x80"') and json.loads('"\ud800"') result in u'\ud800', but the first is not a valid UTF-8 document because it contains an invalid UTF-8 byte sequence that represent a lone surrogate, whereas the second one contains only ASCII bytes and it's therefore valid. Python 2.7 should probably reject '"\xed\xa0\x80"', but since its UTF-8 codec is somewhat permissive already, I'm not sure it makes much sense changing the behavior now. Python 3 doesn't have this problem because it works only with unicode strings, so you can't pass invalid UTF-8 byte sequences. OTOH the Unicode standard says that lone surrogates shouldn't be passed around, so we might decide to replace them with the replacement char U+FFFD, raise an error, or even provide a way to decide what should be done with them (something like the errors argument of codecs).
msg169263 - (view)	Author: Petri Lehtinen (petri.lehtinen) * (Python committer)	Date: 2012年08月28日 10:03
Bear in mind that Douglas Crockford thinks a JSON document is valid even if it contains unpaired surrogates: http://tech.groups.yahoo.com/group/json/message/1603 http://tech.groups.yahoo.com/group/json/message/1583 It's Unicode that considers unpaired surrogates invalid, not UTF-8 by itself.
msg169283 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年08月28日 13:58
> It's Unicode that considers unpaired surrogates invalid, not UTF-8 by itself. It's UTF-8 too. See RFC 3629: The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. When encoding in UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to obtain character numbers, which are then encoded in UTF-8 as described above.
msg171684 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2012年10月01日 00:46
Attached failing test.
msg174484 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年11月01日 22:01
About patch. I think "with" is unnecessary here. One-line self.assertRaises(UnicodeEncodeError, self.dumps, ch) looks better for me.
msg188867 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2013年05月10日 19:25
I forgot about this issue and open a new issue17906. There is a patch for it. Simplejson has accepted it in https://github.com/simplejson/simplejson/issues/62. RFC 4627 does not make exceptions for the range 0xD800-0xDFFF (unescaped = %x20-21 / %x23-5B / %x5D-10FFFF), and the decoder must accept lone surrogates, both escaped and unescaped. Non-BMP characters may be represented as escaped surrogate pair, so escaped surrogate pair may be decoded as non-BMP character, while unescaped surrogate pair shouldn't.
msg189055 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2013年05月12日 19:25
Here are updated patches from issue17906. Updated tests, fixed a bug reported by Bob Ippolito in msg188857 and fixed inconsistency noted by Ezio Melotti on Rietveld (Python implementation now raises same exception as C implementation on illegal hexadecimal escape).
msg200071 - (view)	Author: Taras Prokopenko (taras.prokopenko)	Date: 2013年10月16日 19:05
You should use ensure_ascii=False option to json.dumps, ie import json unicode_bytes = '\xed\xa8\x80' unicode_string = unicode_bytes.decode("utf8") json_encoded = json.dumps(unicode_string, ensure_ascii=False) json.loads(json_encoded),unicode_string (u'\uda00', u'\uda00') cmp(json.loads(json_encoded),unicode_string) 0
msg203470 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2013年11月20日 11:47
I there are no objections I'll commit this patch soon.
msg204516 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2013年11月26日 19:33
New changeset c85305a54e6d by Serhiy Storchaka in branch '2.7': Issue #11489: JSON decoder now accepts lone surrogates. http://hg.python.org/cpython/rev/c85305a54e6d New changeset 8abbdbe86c01 by Serhiy Storchaka in branch '3.3': Issue #11489: JSON decoder now accepts lone surrogates. http://hg.python.org/cpython/rev/8abbdbe86c01 New changeset 5f7326ed850f by Serhiy Storchaka in branch 'default': Issue #11489: JSON decoder now accepts lone surrogates. http://hg.python.org/cpython/rev/5f7326ed850f
msg204882 - (view)	Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) * (Python triager)	Date: 2013年12月01日 05:13
New tests fail on 2.7 branch, at least with Python configured with --enable-unicode=ucs4 (which is default in Gentoo): ====================================================================== FAIL: test_surrogates (json.tests.test_scanstring.TestCScanstring) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/tmp/portage/dev-lang/python-2.7.7_pre20131201/work/python-2.7.7_pre20131201/Lib/json/tests/test_scanstring.py", line 107, in test_surrogates assertScan(u'"z\\ud834\udd20x12345"', u'z\ud834\udd20x12345') File "/var/tmp/portage/dev-lang/python-2.7.7_pre20131201/work/python-2.7.7_pre20131201/Lib/json/tests/test_scanstring.py", line 97, in assertScan (expect, len(given))) AssertionError: Tuples differ: (u'z\ud834\udd20x12345', 16) != (u'z\U0001d120x12345', 16) First differing element 0: z\ud834\udd20x12345 z\U0001d120x12345 - (u'z\ud834\udd20x12345', 16) + (u'z\U0001d120x12345', 16) ====================================================================== FAIL: test_surrogates (json.tests.test_scanstring.TestPyScanstring) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/tmp/portage/dev-lang/python-2.7.7_pre20131201/work/python-2.7.7_pre20131201/Lib/json/tests/test_scanstring.py", line 107, in test_surrogates assertScan(u'"z\\ud834\udd20x12345"', u'z\ud834\udd20x12345') File "/var/tmp/portage/dev-lang/python-2.7.7_pre20131201/work/python-2.7.7_pre20131201/Lib/json/tests/test_scanstring.py", line 97, in assertScan (expect, len(given))) AssertionError: Tuples differ: (u'z\ud834\udd20x12345', 16) != (u'z\U0001d120x12345', 16) First differing element 0: z\ud834\udd20x12345 z\U0001d120x12345 - (u'z\ud834\udd20x12345', 16) + (u'z\U0001d120x12345', 16) ----------------------------------------------------------------------
msg204883 - (view)	Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) * (Python triager)	Date: 2013年12月01日 05:16
... when code is loaded from .pyc files (i.e. when `make test` runs tests the second time).
msg204909 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2013年12月01日 11:03
Thank you Arfrever. Does this patch fix the test?
msg204927 - (view)	Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) * (Python triager)	Date: 2013年12月01日 14:40
test_json_surrogates.patch fixes these tests.
msg204936 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2013年12月01日 15:31
New changeset 02d186e3af09 by Serhiy Storchaka in branch '2.7': Fixed JSON tests on wide build when ran from *.pyc files (issue #11489). http://hg.python.org/cpython/rev/02d186e3af09

History
Date	User	Action	Args
2022年04月11日 14:57:14	admin	set	github: 55698
2013年12月01日 15:42:22	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: resolved
2013年12月01日 15:31:24	python-dev	set	messages: + msg204936
2013年12月01日 14:40:50	Arfrever	set	messages: + msg204927
2013年12月01日 11:03:16	serhiy.storchaka	set	files: + test_json_surrogates.patch messages: + msg204909
2013年12月01日 05:16:20	Arfrever	set	messages: + msg204883
2013年12月01日 05:13:44	Arfrever	set	status: closed -> open nosy: + Arfrever messages: + msg204882 resolution: fixed -> (no value) stage: resolved -> (no value)
2013年11月26日 19:40:18	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2013年11月26日 19:33:43	python-dev	set	nosy: + python-dev messages: + msg204516
2013年11月20日 11:47:04	serhiy.storchaka	set	files: + json_decode_lone_surrogates_3-3.4.patch assignee: serhiy.storchaka messages: + msg203470
2013年10月16日 19:05:33	taras.prokopenko	set	nosy: + taras.prokopenko messages: + msg200071
2013年05月21日 05:36:50	rhettinger	set	assignee: rhettinger -> (no value)
2013年05月12日 19:27:13	serhiy.storchaka	set	components: + Extension Modules stage: needs patch -> patch review
2013年05月12日 19:26:33	serhiy.storchaka	set	files: + json_decode_lone_surrogates_2-2.7.patch
2013年05月12日 19:25:41	serhiy.storchaka	set	files: + json_decode_lone_surrogates_2.patch messages: + msg189055
2013年05月10日 19:25:35	serhiy.storchaka	link	issue17906 superseder
2013年05月10日 19:25:21	serhiy.storchaka	set	nosy: + bob.ippolito messages: + msg188867 versions: + Python 3.3, Python 3.4
2012年11月01日 22:01:26	serhiy.storchaka	set	messages: + msg174484 stage: needs patch
2012年10月01日 00:46:24	ezio.melotti	set	files: + issue11489.diff keywords: + patch messages: + msg171684
2012年08月28日 13:58:29	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg169283
2012年08月28日 10:03:58	petri.lehtinen	set	nosy: + petri.lehtinen messages: + msg169263
2011年10月09日 23:24:10	rhettinger	set	priority: high -> normal assignee: rhettinger
2011年10月09日 23:20:43	ezio.melotti	set	nosy: + pitrou, tchrist versions: - Python 2.6
2011年09月29日 22:22:22	ezio.melotti	set	messages: + msg144646
2011年04月13日 12:37:44	vstinner	set	messages: + msg133662
2011年04月13日 08:34:13	ezio.melotti	set	nosy: + ezio.melotti
2011年04月13日 08:30:49	ezio.melotti	set	files: - unnamed
2011年03月14日 20:21:05	merrellb	set	files: + unnamed messages: + msg130891 nosy: + merrellb
2011年03月14日 20:09:35	rhettinger	set	priority: normal -> high nosy: + rhettinger messages: + msg130889
2011年03月14日 17:31:41	Brian.Merrell	set	nosy: belopolsky, vstinner, Brian.Merrell messages: + msg130862
2011年03月14日 16:19:06	belopolsky	set	nosy: + vstinner, belopolsky messages: + msg130846 versions: + Python 2.7
2011年03月13日 23:17:19	Brian.Merrell	create

homepage