Issue 16223: [doc] untokenize returns a string if no encoding token is recognized

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/60427

classification

Title:	[doc] untokenize returns a string if no encoding token is recognized
Type:	behavior	Stage:	test needed
Components:	Documentation	Versions:	Python 3.11, Python 3.10, Python 3.9

process

Dependencies:	Superseder:
Status:	open	Resolution:
Assigned To:	terry.reedy	Nosy List:	eric.snow, iritkatriel, kurazu, terry.reedy
Priority:	normal	Keywords:	patch

Created on 2012年10月14日 05:49 by eric.snow, last changed 2022年04月11日 14:57 by admin.

Files
File name	Uploaded	Description	Edit
bug16223.patch	kurazu, 2013年07月07日 10:24	review
bug16223_2.patch	kurazu, 2013年07月07日 11:24	review

Messages (6)
msg172850 - (view)	Author: Eric Snow (eric.snow) * (Python committer)	Date: 2012年10月14日 05:49
If you pass an iterable of tokens and none of them are an ENCODING token, tokenize.untokenize() returns a string. This is contrary to what the docs say: It returns bytes, encoded using the ENCODING token, which is the first token sequence output by tokenize(). Either the docs should be clarified or untokenize() fixed. My vote is to fix it. It could check that the first token is an ENCODING token and raise an exception. Alternately it could fall back to using 'utf-8' by default. [1] http://docs.python.org/py3k/library/tokenize.html#tokenize.untokenize
msg192454 - (view)	Author: Tomasz Maćkowiak (kurazu) *	Date: 2013年07月06日 15:40
untokenize has also some other problems, especially when it is using compat - it will skip first significant token, if ENCODING token is not present in input. For example for input like this (code simplified): >>> tokens = tokenize(b"1 + 2") >>> untokenize(tokens[1:]) '+2 ' It also doesn't adhere to another documentation item: "The iterable must return sequences with at least two elements. [...] Any additional sequence elements are ignored." In current implementation sequences can be either 2 or 5 elements long, and in the 5-elements long variant the last 3 elements are not ignored, but used to construct source code with original whitespace. I'm trying to prepare a patch for those issues.
msg192531 - (view)	Author: Tomasz Maćkowiak (kurazu) *	Date: 2013年07月07日 10:24
Attached is a patch for untokenize, it's tests and docs and some minor pep8 improvements. The patch should fix unicode output and some corner cases handling in untokenize.
msg192543 - (view)	Author: Tomasz Maćkowiak (kurazu) *	Date: 2013年07月07日 11:24
Attached corrected ('^' and '$' for regexp in tests) patch.
msg211476 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2014年02月18日 04:27
The no encoding issue was mentioned in #12691, but needed to be opened in a separate issue, which is this one. The doc, as opposed to the docstring, says "Converts tokens back into Python source code". Python 3.3 source code is defined in the reference manual as a sequence of unicode chars. The doc also says "The reconstructed script is returned as a single string." In 3.x, that also means unicode, not bytes. On the other hand tokenize does not currently accept actually Python code (unicode) but only encoded code. I think that should change, but that is a different issue (literally). For this issue, I think the doc and docstring should change to match current behavior: output a string unless the tokens (which contain unicode strings, not bytes) start with a non-empty ENCODING token. Change the behavior would break code that believes the code and doc (as opposed to the docstring). Since tokenize will only put out ENCODING as the first token, I would be inclined to ignore ENCODING thereafter, but that might be seen as an impermisable change in behavior. -- The dropped token issue is the subject of #8478, with patch1. It was mentioned again in #12691, among several other issues, and is the subject again of duplicate issue #16224 (now closed) with patch2. The actual issue is that the first token of iterator input gets dropped, but not that of lists. The fix is reported on #8478, so dropped token is not part of this issue.
msg408038 - (view)	Author: Irit Katriel (iritkatriel) * (Python committer)	Date: 2021年12月08日 16:39
The doc has been updated by now: "It returns bytes, encoded using the ENCODING token, which is the first token sequence output by tokenize(). If there is no encoding token in the input, it returns a str instead." https://docs.python.org/3/library/tokenize.html#tokenize.untokenize The docstring doesn't say this though.

History
Date	User	Action	Args
2022年04月11日 14:57:37	admin	set	github: 60427
2021年12月08日 16:39:39	iritkatriel	set	title: untokenize returns a string if no encoding token is recognized -> [doc] untokenize returns a string if no encoding token is recognized nosy: + iritkatriel messages: + msg408038 versions: + Python 3.9, Python 3.10, Python 3.11, - Python 2.7, Python 3.3, Python 3.4 components: + Documentation
2014年02月18日 04:27:36	terry.reedy	set	versions: - Python 3.2 nosy: + terry.reedy messages: + msg211476 assignee: eric.snow -> terry.reedy
2013年07月07日 11:24:44	kurazu	set	files: + bug16223_2.patch messages: + msg192543
2013年07月07日 10:24:47	kurazu	set	files: + bug16223.patch keywords: + patch
2013年07月07日 10:24:08	kurazu	set	messages: + msg192531
2013年07月06日 15:40:24	kurazu	set	nosy: + kurazu messages: + msg192454
2013年06月25日 05:27:50	eric.snow	set	assignee: eric.snow
2012年10月14日 05:49:11	eric.snow	create

homepage