This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2013年05月05日 13:10 by serhiy.storchaka, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| json_detect_encoding_2.patch | serhiy.storchaka, 2014年05月15日 12:38 | review | ||
| json_detect_encoding_3.patch | serhiy.storchaka, 2016年06月22日 16:57 | review | ||
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 7366 | merged | Anthony Sottile, 2018年06月03日 23:24 | |
| PR 7474 | merged | miss-islington, 2018年06月07日 09:58 | |
| PR 7475 | merged | miss-islington, 2018年06月07日 09:59 | |
| Messages (13) | |||
|---|---|---|---|
| msg188442 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年05月05日 13:10 | |
RFC 4627 specifies a method to determine an encoding (one of UTF-8, UTF-16(BE|LE) or UTF-32(BE|LE)) of encoded JSON text. The proposed preliminary patch (it doesn't include the documentation yet) allows load() and loads() functions accept bytes data when it is encoded with standard Unicode encoding. Also accepted data with BOM (this doesn't specified in RFC 4627, but is widely used). There is only one case where the method can give a misfire. Serialized string "\x00..." encoded in UTF-16LE may be erroneously detected as encoded in UTF-32LE. This case violates the two rules of RFC 4627: the string was serialized instead of a an object or an array, and the control character U+0000 was not escaped. The standard encoded JSON always detected correctly. This patch requires "surrogatepass" error handler for utf-16/32 (see issue12892 and issue13916). |
|||
| msg218608 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年05月15日 12:38 | |
All dependencies for this issue are resolved now. Here is updated patch, synchronized with tip. |
|||
| msg218616 - (view) | Author: Chris Rebert (cvrebert) * | Date: 2014年05月15日 16:07 | |
You'll need to also update the "Character Encodings" subsection of the json docs. |
|||
| msg218640 - (view) | Author: Akira Li (akira) * | Date: 2014年05月16日 02:39 | |
Both json standard (ECMA-404) [1] and the new json rfc 7159 [2] do not mention the encoding detection. [1] http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf [2] https://tools.ietf.org/html/rfc7159#section-8.1 From the rfc: > JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32). Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error. |
|||
| msg218641 - (view) | Author: Chris Rebert (cvrebert) * | Date: 2014年05月16日 04:20 | |
I agree that the state of encoding detection in the new RFC seems unclear, given that the old RFC prefaced the part about the encoding detection with: > Since the first two characters of a JSON text will always be ASCII > characters But in the new RFC: > Appendix A. Changes from RFC 4627 [...] > o Changed the definition of "JSON text" so that it can be any JSON > value, removing the constraint that it be an object or array. Thus, > "ಠ_ಠ" whose 2nd character is decidedly non-ASCII, is now a valid JSON text (i.e. standalone JSON document). There seems to have been a thread about encoding detection in the RFC 7159 working group, but I don't have the time to read through it all: > Re: [Json] JSON: remove gap between Ecma-404 and IETF draft > http://www.ietf.org/mail-archive/web/json/current/msg01936.html It eventually leads to a dedicated sub-thread: > [Json] Encoding detection (Was: Re: JSON: remove gap between Ecma-404 and IETF draft) > http://www.ietf.org/mail-archive/web/json/current/msg01959.html |
|||
| msg230053 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2014年10月27日 01:06 | |
If you adjusted the detect_encoding() logic according to Pete Cordell’s table at the bottom of <http://www.ietf.org/mail-archive/web/json/current/msg01959.html>, it might work for standalone strings. However since the RFC encourages UTF-8 for best interoperability, I wonder if any of this autodetection is necessary. It might be simpler to just assume UTF-8, or use the "utf-8-sig" codec. Or are there real cases where detecting UTF-16 or -32 would be useful? |
|||
| msg273908 - (view) | Author: Stéphane Wirtel (matrixise) * (Python committer) | Date: 2016年08月30日 10:36 | |
Hi Serhiy, I have reviewed your patch, it seems to be ok. |
|||
| msg275611 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2016年09月10日 10:07 | |
Having hit the json.loads() problem recently when porting a project to Python 3, I'm keen to see this land for 3.6. Accodingly, assigning to myself to review and merge Serhiy's patch - if it proves necessary, we can tweak the details of the encoding detection during beta. |
|||
| msg275612 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2016年09月10日 10:16 | |
New changeset e9e1bf9ec2ac by Nick Coghlan in branch 'default': Issue #17909: Accept binary input in json.loads https://hg.python.org/cpython/rev/e9e1bf9ec2ac |
|||
| msg275614 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2016年09月10日 10:18 | |
Thanks for tackling this Serhiy! I removed issue 13916 from the dependency list, as while that's a reasonable suggestion, I don't think this fix is conditional on that change. |
|||
| msg318918 - (view) | Author: Inada Naoki (methane) * (Python committer) | Date: 2018年06月07日 09:58 | |
New changeset bb6366bd7570ff3b74bc66095540bea78f31504e by INADA Naoki (Anthony Sottile) in branch 'master': bpo-17909: Document that json.load can accept a binary IO (GH-7366) https://github.com/python/cpython/commit/bb6366bd7570ff3b74bc66095540bea78f31504e |
|||
| msg318920 - (view) | Author: miss-islington (miss-islington) | Date: 2018年06月07日 10:17 | |
New changeset f38ace61a39e64f5fde6f8f402e258177bdf7ff4 by Miss Islington (bot) in branch '3.7': bpo-17909: Document that json.load can accept a binary IO (GH-7366) https://github.com/python/cpython/commit/f38ace61a39e64f5fde6f8f402e258177bdf7ff4 |
|||
| msg318922 - (view) | Author: miss-islington (miss-islington) | Date: 2018年06月07日 10:21 | |
New changeset 21f2553482c3d6ec8beb8bfa0f1fb5d23c6a4c2f by Miss Islington (bot) in branch '3.6': bpo-17909: Document that json.load can accept a binary IO (GH-7366) https://github.com/python/cpython/commit/21f2553482c3d6ec8beb8bfa0f1fb5d23c6a4c2f |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:45 | admin | set | github: 62109 |
| 2018年06月07日 10:21:22 | miss-islington | set | messages: + msg318922 |
| 2018年06月07日 10:17:20 | miss-islington | set | nosy:
+ miss-islington messages: + msg318920 |
| 2018年06月07日 09:59:23 | miss-islington | set | pull_requests: + pull_request7098 |
| 2018年06月07日 09:58:24 | miss-islington | set | pull_requests: + pull_request7097 |
| 2018年06月07日 09:58:17 | methane | set | nosy:
+ methane messages: + msg318918 |
| 2018年06月03日 23:24:31 | Anthony Sottile | set | pull_requests: + pull_request6992 |
| 2016年09月10日 10:24:12 | ncoghlan | set | status: open -> closed stage: commit review -> resolved |
| 2016年09月10日 10:23:39 | ncoghlan | link | issue22555 dependencies |
| 2016年09月10日 10:21:47 | ncoghlan | link | issue10976 superseder |
| 2016年09月10日 10:18:20 | ncoghlan | set | resolution: fixed dependencies: - disallow the "surrogatepass" handler for non utf-* encodings messages: + msg275614 |
| 2016年09月10日 10:16:45 | python-dev | set | nosy:
+ python-dev messages: + msg275612 |
| 2016年09月10日 10:07:43 | ncoghlan | set | assignee: serhiy.storchaka -> ncoghlan messages: + msg275611 |
| 2016年08月30日 10:38:52 | matrixise | set | stage: patch review -> commit review |
| 2016年08月30日 10:36:24 | matrixise | set | nosy:
+ matrixise messages: + msg273908 |
| 2016年06月22日 16:57:38 | serhiy.storchaka | set | files:
+ json_detect_encoding_3.patch versions: + Python 3.6, - Python 3.5 |
| 2016年05月03日 19:41:56 | gsnedders | set | nosy:
+ gsnedders |
| 2015年03月28日 03:21:56 | berker.peksag | set | nosy:
+ berker.peksag |
| 2014年10月27日 01:06:14 | martin.panter | set | messages: + msg230053 |
| 2014年10月25日 01:14:10 | martin.panter | set | nosy:
+ martin.panter |
| 2014年05月16日 04:20:11 | cvrebert | set | messages: + msg218641 |
| 2014年05月16日 02:39:43 | akira | set | nosy:
+ akira messages: + msg218640 |
| 2014年05月15日 16:07:28 | cvrebert | set | messages: + msg218616 |
| 2014年05月15日 12:39:43 | serhiy.storchaka | set | files: - json_detect_encoding.patch |
| 2014年05月15日 12:38:58 | serhiy.storchaka | set | files:
+ json_detect_encoding_2.patch messages: + msg218608 |
| 2014年05月15日 07:26:13 | vstinner | set | nosy:
+ vstinner |
| 2014年03月29日 01:40:25 | cvrebert | set | nosy:
+ cvrebert |
| 2014年03月04日 12:42:50 | jleedev | set | nosy:
+ jleedev |
| 2013年12月02日 03:18:12 | Julian | set | nosy:
+ Julian |
| 2013年12月01日 00:08:35 | pitrou | set | versions: + Python 3.5, - Python 3.4 |
| 2013年11月30日 11:07:13 | pitrou | set | nosy:
+ ncoghlan |
| 2013年08月10日 14:31:38 | serhiy.storchaka | set | stage: patch review |
| 2013年05月05日 13:11:10 | serhiy.storchaka | set | dependencies: + UTF-16 and UTF-32 codecs should reject (lone) surrogates, disallow the "surrogatepass" handler for non utf-* encodings |
| 2013年05月05日 13:10:30 | serhiy.storchaka | create | |