Issue 19837: Wire protocol encoding for the JSON module

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64036

classification

Title:	Wire protocol encoding for the JSON module
Type:	enhancement	Stage:
Components:	Library (Lib)	Versions:	Python 3.5

process

Dependencies:	Superseder:
Status:	open	Resolution:
Assigned To:	Nosy List:	Clay Gerrard, barry, chrism, cvrebert, eric.araujo, ezio.melotti, gregory.p.smith, jleedev, kdwyer, martin.panter, ncoghlan, pitrou, serhiy.storchaka, socketpair, terry.reedy, vstinner
Priority:	normal	Keywords:

Created on 2013年11月30日 02:30 by ncoghlan, last changed 2022年04月11日 14:57 by admin.

Messages (26)
msg204764 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2013年11月30日 02:30
In the Python 3 transition, we had to make a choice regarding whether we treated the JSON module as a text transform (with load[s] reading Unicode code points and dump[s] producing them), or as a text encoding (with load[s] reading binary sequences and dump[s] producing them). To minimise the changes to the module API, the decision was made to treat it as a text transform, with the text encoding handled externally. This API design decision doesn't appear to have worked out that well in the web development context, since JSON is typically encountered as a UTF-8 encoded wire protocol, not as already decoded text. It also makes the module inconsistent with most of the other modules that offer "dumps" APIs, as those are specifically about wire protocols (Python 3.4): >>> import json, marshal, pickle, plistlib, xmlrpc.client >>> json.dumps('hello') '"hello"' >>> marshal.dumps('hello') b'\xda\x05hello' >>> pickle.dumps('hello') b'\x80\x03X\x05\x00\x00\x00helloq\x00.' >>> plistlib.dumps('hello') b'<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">\n<plist version="1.0">\n<string>hello</string>\n</plist>\n' The only module with a dumps function that (like the json module) returns a string, is the XML-RPC client module: >>> xmlrpc.client.dumps(('hello',)) '<params>\n<param>\n<value><string>hello</string></value>\n</param>\n</params>\n' And that's nonsensical, since that XML-RPC API accepts an encoding argument, which it now silently ignores: >>> xmlrpc.client.dumps(('hello',), encoding='utf-8') '<params>\n<param>\n<value><string>hello</string></value>\n</param>\n</params>\n' >>> xmlrpc.client.dumps(('hello',), encoding='utf-16') '<params>\n<param>\n<value><string>hello</string></value>\n</param>\n</params>\n' I now believe that an "encoding" parameter should have been added to the json.dump API in the Py3k transition (defaulting to UTF-8), allowing all of the dump/load APIs in the standard library to be consistently about converting to and from a binary wire protocol. Unfortunately, I don't have a solution to offer at this point (since backwards compatibility concerns rule out the simple solution of just changing the return type). I just wanted to get it on record as a problem (and internal inconsistency within the standard library for dump/load protocols) with the current API.
msg204765 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2013年11月30日 02:35
The other simple solution would be to add <name>b variants of the affected APIs. That's a bit ugly though, especially since it still has the problem of making it difficult to write correct cross-version code (although that problem is likely to exist regardless)
msg204799 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2013年11月30日 11:07
Still, JSON itself is not a wire protocol; HTTP is. http://www.json.org states that "JSON is a text format" and the grammar description talks "UNICODE characters", not bytes. The ECMA spec states that "JSON text is a sequence of Unicode code points". RFC 4627 is a bit more affirmative, though, and says that "JSON text SHALL be encoded in Unicode [sic]. The default encoding is UTF-8". Related issues: - issue #10976: json.loads() raises TypeError on bytes object - issue #17909 (+ patch!): autodetecting JSON encoding > The other simple solution would be to add <name>b variants of the affected APIs. "dumpb" is not very pretty and can easily be misread as "dumb" :-) "dump_bytes" looks better to me.
msg204805 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2013年11月30日 11:59
I propose close this issue as a duplicate of issue10976.
msg204811 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2013年11月30日 14:08
Not sure yet if we should merge the two issues, although they're the serialisation and deserialisation sides of the same problem. Haskell seems to have gone with the approach of a separate "jsonb" API for the case where you want the wire protocol behaviour, such a solution may work for us as well.
msg204864 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2013年12月01日 00:24
I'm -1 for a new module doing almost the same thing. Let's add distinct APIs in the existing json module.
msg204873 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2013年12月01日 01:55
The problem with adding new APIs with different names to the JSON module is that it breaks symmetry with other wire protocols. The quartet of module level load, loads, dump and dumps functions has become a de facto standard API for wire protocols. If it wasn't for that API convention, the status quo would be substantially less annoying (and confusing) than it currently is. The advantage of a separate "jsonb" module is that it becomes easy to say "json is the text transform that dumps and loads from a Unicode string, jsonb is the wire protocol that dumps and loads a UTF encoded byte sequence". Backporting as simplejsonb would also work in a straightforward fashion (since one PyPI package can include multiple top level Python modules). The same approach would also extend to fixing the xmlrpc module to handle the encoding step properly (if anyone was so inclined).
msg204904 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2013年12月01日 10:36
> The problem with adding new APIs with different names to the JSON > module is that it breaks symmetry with other wire protocols. The > quartet of module level load, loads, dump and dumps functions has > become a de facto standard API for wire protocols. Breaking symmetry is terribly less silly than having a second module doing almost the same thing, though. > The advantage of a separate "jsonb" module is that it becomes easy to > say "json is the text transform that dumps and loads from a Unicode > string, jsonb is the wire protocol that dumps and loads a UTF encoded > byte sequence". This is a terribly lousy design.
msg204939 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2013年12月01日 15:55
I agree that adding a new module is very bad idea. I think that the reviving the encoding parameter is a lest wrong way. json.dumps() should return bytes when the encoding argument is specifiead and str otherwise. json.dump() should write binary data when the encoding argument is specifiead and a text otherwise. This is not perfect design, but it has precendences in XML modules.
msg204960 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2013年12月01日 21:03
Changing return type based on argument values is still a bad idea in general. It also makes it hard to plug the API in to generic code that is designed to work with any dump/load based serialisation protocol. MvL suggested a json.bytes submodule (rather than a separate top level module) in the other issue and that sounds reasonable to me, especially since json is already implemented as a package.
msg204963 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2013年12月01日 21:21
> MvL suggested a json.bytes submodule (rather than a separate top level > module) in the other issue and that sounds reasonable to me, especially > since json is already implemented as a package. I don't really find it reasonable to add a phantom module entirely for the purpose of exposing an API more similar to the Python 2 one. I don't think this design pattern has already been used. If we add a json_bytes method, it will be simple enough for folks to add the appropriate rules in their compat module (and/or for six to expose it).
msg204976 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2013年12月01日 23:09
The parallel API would have to be: json.dump_bytes json.dumps_bytes json.load_bytes json.loads_bytes That is hardly an improvement over: json.bytes.dump json.bytes.dumps json.bytes.load json.bytes.loads It doesn't need to be documented as a completely separate module, it can just be a subsection in the json module docs with a reference to the relevant RFC. The confusion is inherent in the way the RFC was written, this is just an expedient way to resolve that: the json module implements the standard, the bytes submodule implements the RFC. "Namespaces are a honking great idea; let's do more of those"
msg204978 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2013年12月01日 23:19
> The parallel API would have to be: > > json.dump_bytes > json.dumps_bytes > json.load_bytes > json.loads_bytes No, only one function dump_bytes() is needed, and it would return a bytes object ("dumps" meaning "dump string", already). loads() can be polymorphic without creating a new function. I don't think the functions taking file objects are used often enough to warrant a second API to deal with binary files. > It doesn't need to be documented as a completely separate module, it can > just be a subsection in the json module docs with a reference to the > relevant RFC. It's still completely weird and unusual. > "Namespaces are a honking great idea; let's do more of those" And also "flat is better than nested". Especially when you're proposing than one API be at level N, and the other, closely related API be at level N+1.
msg205023 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2013年12月02日 16:19
> Changing return type based on argument values is still a bad idea in > general. However load() and loads() do this. ;) > It also makes it hard to plug the API in to generic code that is designed > to work with any dump/load based serialisation protocol. For dumps() it will be simple -- `lambda x: json.dumps(x, encoding='utf-8')`. For loads() it will be even simpler -- loads() will accept both strings and bytes. Note that dumps() with the encoding parameter will be more 2.x compatible than current implementation. This will help in writing compatible code.
msg205415 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2013年12月07日 00:08
> Changing return type based on argument values is still a bad idea in general. I understand the proposal to be changing the return based on argument presence. It strikes me a a convenient abbreviation for making a separate encoding call and definitely (specifically?) less bad than a separate module or separate functions.
msg205416 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2013年12月07日 00:11
To give another data point: returning a different type based on argument value is also what the open() functions does, more or less. (that said, I would slightly favour a separate dump_bytes(), myself)
msg205530 - (view)	Author: Gregory P. Smith (gregory.p.smith) * (Python committer)	Date: 2013年12月08日 08:55
upstream simplejson (of which json is an earlier snapshot of) has an encoding parameter on its dump and dumps method. Lets NOT break compatibility with that API. Our users use these modules interchangeably today, upgrading from stdlib json to simplejson when they need more features or speed without having to change their code. simplejson's dumps(encoding=) parameter tells the module what encoding to decode bytes objects found within the data structure as (whereas Python 3.3's builtin json module being older doesn't even support that use case and raises a TypeError when bytes are encountered within the structure being serialized). http://simplejson.readthedocs.org/en/latest/ A json.dump_bytes() function implemented as: def dump_bytes(args, kwargs): return dumps(args, **kwargs).encode('utf-8') makes some sense.. but it is really trivial for anyone to write that .encode(...) themselves. a dump_bytes_to_file method that acts like dump() and calls .encode('utf-8') on all str's before passing them to the write call is also doable... but it seems easier to just let people use an existing io wrapper to do that for them as they already are. As for load/loads, it is easy to allow that to accept bytes as input and assume it comes utf-8 encoded. simplejson already does this. json does not.
msg205531 - (view)	Author: Gregory P. Smith (gregory.p.smith) * (Python committer)	Date: 2013年12月08日 09:00
So why not put a dump_bytes into upstream simplejson first, then pull in a modern simplejson? There might be some default flag values pertaining to new features that need changing for stdlib backwards compatible behavior but otherwise I expect it's a good idea.
msg271700 - (view)	Author: Марк Коренберг (socketpair) *	Date: 2016年07月30日 18:12
One of the problem, that decodeing JSON is FSM, where input is one symbol rather than one byte. AFAIK, Python still does not have FSM for decoding UTF-8 sequence, so iterative decoding of JSON will require more changes than expected.
msg271701 - (view)	Author: Марк Коренберг (socketpair) *	Date: 2016年07月30日 18:32
In real life, I can confirm, that porting from Python2 to Python3 is almost automatic except JSON-related fixes.
msg271775 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2016年08月01日 08:17
I'm currently migrating a project that predates requests, and ended up needing to replace several "json.loads" calls with a "_load_json" helper that is just an alias for json.loads in Python 2, and defined as this in Python 3: def _load_json(data): return json.loads(data.decode()) To get that case to "just work", all I would have needed is for json.loads to accept bytes input, and assume it is UTF-8 encoded, that same way simplejson does. Since there aren't any type ambiguities associated with that, I think it would make sense for us to go ahead and implement at least that much for Python 3.6. By contrast, if I'd been doing encoding, I don't think there's anything the Python 3 standard library could have changed on its own to make things just work - I would have needed to change my code somehow. However, a new "dump_bytes" API could still be beneficial on that front as long as it was also added to simplejson: code that needed to run in the common Python 2/3 subset could use "simplejson.dump_bytes", while 3.6+ only code could just use the standard library version. Having dump_bytes() next to dumps() in the documentation would also provide a better hook for explaining the difference between JSON-as-text-encoding (with "str" output) and JSON-as-wire-encoding (with "bytes" output after encoding the str representation as UTF-8). In both cases, I think it would make sense to leave the non-UTF-8 support to simplejson and have the stdlib version be UTF-8 only.
msg271776 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2016年08月01日 08:25
Does dump_bytes() return bytes (similar to dumps()) or write to binary stream (similar to dump())?
msg271778 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2016年08月01日 10:24
dump_bytes() would be a binary counterpart to dumps() The dump() case is already handled more gracefully, as the implicit encoding to UTF-8 can live on the file-like object, rather than needing to be handled by the JSON encoder. I'm still not 100% sure on its utility though - it's only "json.loads assuming binary input is UTF-8 encoded text would be way more helpful than the current behaviour" that I'm confident about. If the assumption is wrong, you'll likely fail JSON deserialisation anyway, and when it's right, the common subset of Python 2 & 3 has been expanded in a useful way. So perhaps we should split the question into two issues? A new one for accepting binary data as an input to json.loads, and make this one purely about whether or not to offer a combined serialise-and-encode operation for the wire protocol use case?
msg272726 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2016年08月15日 07:21
After hitting this problem again in another nominally single-source compatible Python 2/3 project, I created #27765 to specifically cover accepting UTF-8 encoded bytes in json.loads()
msg275617 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2016年09月10日 10:25
For 3.6, the decoding case has been handled via Serhiy's autodetection patch in issue 17909
msg289203 - (view)	Author: Clay Gerrard (Clay Gerrard)	Date: 2017年03月08日 05:59
and for encoding case? Can you just add the encoding argument back to json.dumps? Have it default to None because of backwards compatibility in python3 and continue to return strings by default... ... and then everyone that ever wants to serialize an object to json because they want to put it on a wire or w/e will hopefully someday learn when you call json.dumps you always set encoding='utf-8' and it will always return utf-8 encoded bytes (which is the same thing it would have done py2 regardless)? Is it confusing for the py3 encoding argument to mean something different than py2? Probably? The encoding argument in py2 was there to tell the Encoder how to decode keys and values who's strings were acctually utf-8 encoded bytes. But w/e py3 doesn't have that problem - so py3 can unambiguously hijack dumps' encoding param to mean bytes! Then, sure, maybe the fact I can write: sock.send(json.dumps(obj, encoding='utf-8')) ... in either language is just a happy coincidence - but it'd be useful nevertheless. Or I could be wrong. I've not been thinking about this for 3 years. But I have bumped into this a couple of times in the years since starting to dream of python 3.2^H4^H5^H6^H7 support - but until then I do seem to frequently forget json.dumps(obj).decode('utf-8') so maybe my suggestion isn't really any better!?

History
Date	User	Action	Args
2022年04月11日 14:57:54	admin	set	github: 64036
2017年03月08日 05:59:28	Clay Gerrard	set	nosy: + Clay Gerrard messages: + msg289203
2016年09月10日 10:25:28	ncoghlan	set	messages: + msg275617
2016年08月15日 07:21:41	ncoghlan	set	messages: + msg272726
2016年08月01日 10:24:05	ncoghlan	set	messages: + msg271778
2016年08月01日 08:25:29	serhiy.storchaka	set	messages: + msg271776
2016年08月01日 08:17:33	ncoghlan	set	messages: + msg271775
2016年07月30日 18:32:41	socketpair	set	messages: + msg271701
2016年07月30日 18:12:50	socketpair	set	nosy: + socketpair messages: + msg271700
2016年07月29日 21:45:31	kdwyer	set	nosy: + kdwyer
2014年10月25日 01:03:22	martin.panter	set	nosy: + martin.panter
2014年05月15日 07:26:10	vstinner	set	nosy: + vstinner
2014年03月29日 01:36:51	cvrebert	set	nosy: + cvrebert
2014年03月04日 12:46:40	jleedev	set	nosy: + jleedev
2014年02月15日 14:33:14	ezio.melotti	set	nosy: + ezio.melotti type: enhancement
2013年12月08日 09:00:11	gregory.p.smith	set	messages: + msg205531
2013年12月08日 08:55:30	gregory.p.smith	set	nosy: + gregory.p.smith messages: + msg205530
2013年12月07日 00:11:27	pitrou	set	messages: + msg205416
2013年12月07日 00:08:44	terry.reedy	set	nosy: + terry.reedy messages: + msg205415
2013年12月06日 17:46:00	eric.araujo	set	nosy: + eric.araujo
2013年12月02日 16:19:03	serhiy.storchaka	set	messages: + msg205023
2013年12月01日 23:19:29	pitrou	set	messages: + msg204978
2013年12月01日 23:09:55	ncoghlan	set	messages: + msg204976
2013年12月01日 21:21:32	pitrou	set	messages: + msg204963
2013年12月01日 21:03:44	ncoghlan	set	messages: + msg204960
2013年12月01日 15:55:34	serhiy.storchaka	set	messages: + msg204939
2013年12月01日 10:36:17	pitrou	set	messages: + msg204904
2013年12月01日 01:55:02	ncoghlan	set	messages: + msg204873
2013年12月01日 00:24:14	pitrou	set	messages: + msg204864
2013年11月30日 15:22:30	barry	set	nosy: + barry
2013年11月30日 14:08:10	ncoghlan	set	messages: + msg204811
2013年11月30日 11:59:43	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg204805
2013年11月30日 11:07:56	pitrou	set	nosy: + pitrou messages: + msg204799
2013年11月30日 02:35:33	ncoghlan	set	messages: + msg204765
2013年11月30日 02:30:45	ncoghlan	create

homepage