This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010年09月28日 13:17 by meador.inge, last changed 2022年04月11日 14:57 by admin.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| issue9969.patch | meador.inge, 2011年09月05日 02:11 | Patch against tip (3.3.0a0) | review | |
| Messages (11) | |||
|---|---|---|---|
| msg117516 - (view) | Author: Meador Inge (meador.inge) * (Python committer) | Date: 2010年09月28日 13:17 | |
Currently with 'py3k' only 'bytes' objects are accepted for tokenization:
>>> import io
>>> import tokenize
>>> tokenize.tokenize(io.StringIO("1+1").readline)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/minge/Code/python/py3k/Lib/tokenize.py", line 360, in tokenize
encoding, consumed = detect_encoding(readline)
File "/Users/minge/Code/python/py3k/Lib/tokenize.py", line 316, in detect_encoding
if first.startswith(BOM_UTF8):
TypeError: Can't convert 'bytes' object to str implicitly
>>> tokenize.tokenize(io.BytesIO(b"1+1").readline)
<generator object _tokenize at 0x1007566e0>
In a discussion on python-dev (http://www.mail-archive.com/python-dev@python.org/msg52107.html) it was generally considered to be a good idea to add support for tokenizing 'str' objects as well.
|
|||
| msg117523 - (view) | Author: Michael Foord (michael.foord) * (Python committer) | Date: 2010年09月28日 14:04 | |
Note from Nick Coghlan from the Python-dev discussion: A very quick scan of _tokenize suggests it is designed to support detect_encoding returning None to indicate the line iterator will return already decoded lines. This is confirmed by the fact the standard library uses it that way (via generate_tokens). An API that accepts a string, wraps a StringIO around it, then calls _tokenise with an encoding of None would appear to be the answer here. A feature request on the tracker is the best way to make that happen. |
|||
| msg117554 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2010年09月28日 21:54 | |
Possible approach (untested): def get_tokens(source): if hasattr(source, "encode"): # Already decoded, so bypass encoding detection return _tokenize(io.StringIO(source).readline, None) # Otherwise attempt to detect the correct encoding return tokenize(io.BytesIO(source).readline) |
|||
| msg117571 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年09月29日 01:06 | |
See also issue #4626 which introduced PyCF_IGNORE_COOKIE and PyPARSE_IGNORE_COOKIE flags to support unicode string for the builtin compile() function. |
|||
| msg117652 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2010年09月29日 20:46 | |
As per Antoine's comment on #9873, requiring a real string via isinstance(source, str) to trigger the string IO version is likely to be cleaner than attempting to duck-type this. Strings are an area where we make so many assumptions about the way their internals work that duck-typing generally isn't all that effective. |
|||
| msg121712 - (view) | Author: Abhay Saxena (ark3) | Date: 2010年11月20日 18:43 | |
If the goal is tokenize(...) accepting a text I/O readline, we already have the (undocumented) generate_tokens(readline). |
|||
| msg121843 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2010年11月21日 02:54 | |
The idea is bring the API up a level, and also take care of wrapping the file-like object around the source string/byte sequence. |
|||
| msg143506 - (view) | Author: Meador Inge (meador.inge) * (Python committer) | Date: 2011年09月05日 02:11 | |
Attached is a first cut at a patch. |
|||
| msg252299 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年10月05日 03:27 | |
I left some comments. Also, it would be nice to use the new function in the documentation example, which currently suggests tunnelling through UTF-8 but not adding an encoding comment. And see the patch for Issue 12486, which highlights a couple of other places that would benefit from this function. |
|||
| msg252303 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年10月05日 05:14 | |
Actually maybe Issue 12486 is good enough to fix this too. With the patch proposed there, tokenize_basestring("source") would just be equivalent to tokenize(StringIO("source").readline) |
|||
| msg316983 - (view) | Author: Thomas Kluyver (takluyver) * | Date: 2018年05月17日 20:48 | |
I've opened a PR for issue #12486, which would make the existing but undocumented 'generate_tokens' function public: https://github.com/python/cpython/pull/6957 I agree that it would be good to design a nicer API for this, but the perfect shouldn't be the enemy of the good. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:07 | admin | set | github: 54178 |
| 2018年05月17日 20:48:30 | takluyver | set | messages: + msg316983 |
| 2015年10月05日 05:14:30 | martin.panter | set | messages: + msg252303 |
| 2015年10月05日 03:27:26 | martin.panter | set | nosy:
+ martin.panter messages: + msg252299 |
| 2012年10月15日 13:28:08 | serhiy.storchaka | set | versions: + Python 3.4, - Python 3.2, Python 3.3 |
| 2011年09月05日 02:11:52 | meador.inge | set | files:
+ issue9969.patch keywords: + patch messages: + msg143506 stage: needs patch -> patch review |
| 2011年05月31日 18:16:30 | takluyver | set | nosy:
+ takluyver |
| 2010年11月21日 02:54:03 | ncoghlan | set | messages: + msg121843 |
| 2010年11月20日 18:43:03 | ark3 | set | nosy:
+ ark3 messages: + msg121712 |
| 2010年09月29日 20:46:41 | ncoghlan | set | messages: + msg117652 |
| 2010年09月29日 01:06:53 | vstinner | set | nosy:
+ vstinner messages: + msg117571 |
| 2010年09月28日 21:54:16 | ncoghlan | set | nosy:
+ ncoghlan messages: + msg117554 |
| 2010年09月28日 14:04:57 | michael.foord | set | messages: + msg117523 |
| 2010年09月28日 13:34:51 | michael.foord | set | nosy:
+ michael.foord |
| 2010年09月28日 13:18:28 | meador.inge | set | components: + Library (Lib) |
| 2010年09月28日 13:17:17 | meador.inge | create | |