Message 313168 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	steve
Recipients	ezio.melotti, steve, vstinner
Date	2018年03月02日.23:32:49
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1520033569.84.0.467229070634.issue32987@psf.upfronthosting.co.za>

Content
Here is an example involving the unicode character MIDDLE DOT · : The line ab·cd = 7 is valid Python 3 code and is happily accepted by the CPython interpreter. However, tokenize.py does not like it. It says that the middle-dot is an error token. Here is an example you can run to see that: import tokenize from io import BytesIO test = 'ab·cd = 7'.encode('utf-8') x = tokenize.tokenize(BytesIO(test).readline) for i in x: print(i) For reference, the official definition of identifiers is: https://docs.python.org/3.6/reference/lexical_analysis.html#identifiers and details about MIDDLE DOT are at https://www.unicode.org/Public/10.0.0/ucd/PropList.txt MIDDLE DOT has the "Other_ID_Continue" property, so I think the interpreter is behaving correctly (i.e. consistent with the documented spec), while tokenize.py is wrong.

Content

Here is an example involving the unicode character MIDDLE DOT · : The line
ab·cd = 7
is valid Python 3 code and is happily accepted by the CPython interpreter. However, tokenize.py does not like it. It says that the middle-dot is an error token. Here is an example you can run to see that:
 import tokenize
 from io import BytesIO
 
 test = 'ab·cd = 7'.encode('utf-8')
 
 x = tokenize.tokenize(BytesIO(test).readline)
 for i in x: print(i)
For reference, the official definition of identifiers is: 
https://docs.python.org/3.6/reference/lexical_analysis.html#identifiers
and details about MIDDLE DOT are at
https://www.unicode.org/Public/10.0.0/ucd/PropList.txt
MIDDLE DOT has the "Other_ID_Continue" property, so I think the interpreter is behaving correctly (i.e. consistent with the documented spec), while tokenize.py is wrong.

History
Date	User	Action	Args
2018年03月02日 23:32:49	steve	set	recipients: + steve, vstinner, ezio.melotti
2018年03月02日 23:32:49	steve	set	messageid: <1520033569.84.0.467229070634.issue32987@psf.upfronthosting.co.za>
2018年03月02日 23:32:49	steve	link	issue32987 messages
2018年03月02日 23:32:49	steve	create

homepage