homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: JSON should accept lone surrogates
Type: behavior Stage: patch review
Components: Extension Modules, Library (Lib), Unicode Versions: Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: json.dumps not parsable by json.loads (on Linux only)
View: 11489
Assigned To: serhiy.storchaka Nosy List: bob.ippolito, ezio.melotti, pitrou, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2013年05月04日 14:38 by serhiy.storchaka, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
json_decode_lone_surrogates.patch serhiy.storchaka, 2013年05月05日 11:45 Patch for 3.3 and 3.4 review
json_decode_lone_surrogates-2.7.patch serhiy.storchaka, 2013年05月05日 11:45 Patch for 2.7 review
Messages (7)
msg188364 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013年05月04日 14:38
Inspired by simplejson issue [1] which is related to standard json module too. JSON parser 3.3+ and wide builds of 3.2- raise an error on invalid strings (i.e. with unpaired surrogate), while narrow builds and some third-party parsers. Wide builds are right, such JSON data is invalid. However it will be good to be optionally more permissive to input data. Otherwise it is not easy process such invalid data.
I propose to add an "error" parameter to JSON decoder and encoder with the same meaning as in string decoding/encoding. "strict" is default and "surrogatepass" corresponds to narrow builds (and non-strict third-party parsers).
[1] https://github.com/simplejson/simplejson/issues/62 
msg188374 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013年05月04日 15:59
I wonder if json should simply be less strict by default. If you pass the raw unescaped character, the json module accepts it:
>>> json.loads('{"a": "\ud8e9"}')
{'a': '\ud8e9'}
It's only if you pass the escaped representation that json rejects it:
>>> json.loads('{"a": "\\ud8e9"}')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/home/antoine/cpython/default/Lib/json/__init__.py", line 316, in loads
 return _default_decoder.decode(s)
 File "/home/antoine/cpython/default/Lib/json/decoder.py", line 344, in decode
 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
 File "/home/antoine/cpython/default/Lib/json/decoder.py", line 360, in raw_decode
 obj, end = self.scan_once(s, idx)
ValueError: Unpaired high surrogate: line 1 column 9 (char 8)
msg188375 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013年05月04日 16:01
See also #11489.
msg188437 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013年05月05日 11:45
After investigating the problem deeper, I see that new parameter is not needed. RFC 4627 does not make exceptions for the range 0xD800-0xDFFF, and the decoder must accept lone surrogates, both escaped and unescaped. Non-BMP characters may be represented as escaped surrogate pair, so escaped surrogate pair may be decoded as non-BMP character, while unescaped surrogate pair shouldn't.
Here is a patch, with which JSON decoder accepts encoded lone surrogates. Also fixed a bug when Python implementation decodes "\\ud834\\u0079x" as "\U0001d179".
msg188857 - (view) Author: Bob Ippolito (bob.ippolito) * (Python committer) Date: 2013年05月10日 18:08
The patch that I wrote for simplejson is here (it differs a bit from serhiy's patch): https://github.com/simplejson/simplejson/commit/35816bfe2d0ddeb5ddcc68239683cbb35b7e3ff2
I discovered another bug along the way in the pure-Python scanstring, int(s, 16) will parse '0xNN' when json expects only strings of the form 'NNNN' to work. I fixed that along with this issue by explicitly checking for x or X.
msg188868 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013年05月10日 19:25
I forgot about issue11489. After reclassification this issue is it's duplicate.
msg189056 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013年05月12日 19:28
Updated patch submitted in issue11489.
History
Date User Action Args
2022年04月11日 14:57:45adminsetgithub: 62106
2013年05月12日 19:28:20serhiy.storchakasetstatus: open -> closed

messages: + msg189056
2013年05月10日 19:25:35serhiy.storchakasetsuperseder: json.dumps not parsable by json.loads (on Linux only)
resolution: duplicate
messages: + msg188868
2013年05月10日 18:08:10bob.ippolitosetmessages: + msg188857
2013年05月05日 11:45:58serhiy.storchakasetfiles: + json_decode_lone_surrogates-2.7.patch
2013年05月05日 11:45:06serhiy.storchakasetfiles: + json_decode_lone_surrogates.patch

title: Add a string error handler to JSON encoder/decoder -> JSON should accept lone surrogates
keywords: + patch
type: enhancement -> behavior
versions: + Python 2.7, Python 3.3
messages: + msg188437
stage: needs patch -> patch review
2013年05月04日 16:01:18ezio.melottisetmessages: + msg188375
2013年05月04日 15:59:32pitrousetmessages: + msg188374
2013年05月04日 14:38:28serhiy.storchakacreate

AltStyle によって変換されたページ (->オリジナル) /