This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2018年03月02日 23:32 by steve, last changed 2022年04月11日 14:58 by admin. This issue is now closed.
| Messages (6) | |||
|---|---|---|---|
| msg313168 - (view) | Author: Steve B (steve) | Date: 2018年03月02日 23:32 | |
Here is an example involving the unicode character MIDDLE DOT · : The line
ab·cd = 7
is valid Python 3 code and is happily accepted by the CPython interpreter. However, tokenize.py does not like it. It says that the middle-dot is an error token. Here is an example you can run to see that:
import tokenize
from io import BytesIO
test = 'ab·cd = 7'.encode('utf-8')
x = tokenize.tokenize(BytesIO(test).readline)
for i in x: print(i)
For reference, the official definition of identifiers is:
https://docs.python.org/3.6/reference/lexical_analysis.html#identifiers
and details about MIDDLE DOT are at
https://www.unicode.org/Public/10.0.0/ucd/PropList.txt
MIDDLE DOT has the "Other_ID_Continue" property, so I think the interpreter is behaving correctly (i.e. consistent with the documented spec), while tokenize.py is wrong.
|
|||
| msg313496 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2018年03月09日 20:19 | |
I verified on Win10 with 3.5 (which cannot be patched) and 3.7.0b2 that ab·cd is accepted as a name and that tokenize fails as described. |
|||
| msg313792 - (view) | Author: Cheryl Sabella (cheryl.sabella) * (Python committer) | Date: 2018年03月13日 22:57 | |
I believe this may be a duplicate of issue 12486. |
|||
| msg313797 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2018年03月14日 01:08 | |
I think the issues are slightly different. #12486 is about the awkwardness of the API. This is about a false error after jumping through the hoops, which I think Steve B did correctly. Following the link, the Other_ID_Continue chars are 00B7 ; Other_ID_Continue # Po MIDDLE DOT 0387 ; Other_ID_Continue # Po GREEK ANO TELEIA 1369..1371 ; Other_ID_Continue # No [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE 19DA ; Other_ID_Continue # No NEW TAI LUE THAM DIGIT ONE # Total code points: 12 The 2 Po chars fail, the 2 No chars work. After looking at the tokenize module, I believe the problem is the re for Name is r'\w+' and the Po chars are not seen as \w word characters. >>> r = re.compile(r'\w+', re.U) >>> re.match(r, 'ab\u0387cd') <re.Match object; span=(0, 2), match='ab'> I don't know if the bug is a too narrow definition of \w in the re module("most characters that can be part of a word in any language, as well as numbers and the underscore") or of Name in the tokenize module. Before patching anything, I would like to know if the 2 Po Other chars are the only 2 not matched by \w. Unless someone has done so already, at least a sample of chars from each category included in the definition of 'identifier' should be tested. |
|||
| msg313814 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2018年03月14日 08:02 | |
This issue and issue12486 doesn't have any common except that both are related to the tokenize module. There are two bugs: a too narrow definition of \w in the re module (see issue12731 and issue1693050) and a too narrow definition of Name in the tokenize module. >>> allchars = list(map(chr, range(0x110000))) >>> start = [c for c in allchars if c.isidentifier()] >>> cont = [c for c in allchars if ('a'+c).isidentifier()] >>> import re, regex, unicodedata >>> for c in regex.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in regex.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... '·' U+00B7 MIDDLE DOT '·' U+0387 GREEK ANO TELEIA '፩' U+1369 ETHIOPIC DIGIT ONE '፪' U+136A ETHIOPIC DIGIT TWO '፫' U+136B ETHIOPIC DIGIT THREE '፬' U+136C ETHIOPIC DIGIT FOUR '፭' U+136D ETHIOPIC DIGIT FIVE '፮' U+136E ETHIOPIC DIGIT SIX '፯' U+136F ETHIOPIC DIGIT SEVEN '፰' U+1370 ETHIOPIC DIGIT EIGHT '፱' U+1371 ETHIOPIC DIGIT NINE '᧚' U+19DA NEW TAI LUE THAM DIGIT ONE '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in re.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... 'ᢅ' U+1885 MONGOLIAN LETTER ALI GALI BALUDA 'ᢆ' U+1886 MONGOLIAN LETTER ALI GALI THREE BALUDA '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in re.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... '·' U+00B7 MIDDLE DOT '̀' U+0300 COMBINING GRAVE ACCENT '́' U+0301 COMBINING ACUTE ACCENT '̂' U+0302 COMBINING CIRCUMFLEX ACCENT '̃' U+0303 COMBINING TILDE ... [total 2177 characters] The second bug can be solved by adding 14 more characters in the pattern for Name. Name = r'[\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]+' or Name = r'[\w\u2118\u212e][\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]*' But first the issue with \w should be resolved (if we don't want to add 2177 characters). The other solution is implementing property support in re (issue12734). |
|||
| msg313852 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2018年03月15日 00:58 | |
#24194 is about tokenize failing, including on middle dot. There is another tokenize name issue, already closed. I referenced Serhiy's analysis there and on the two \w issues, and closed one of them. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:58:58 | admin | set | github: 77168 |
| 2018年03月15日 00:58:29 | terry.reedy | set | status: open -> closed superseder: Make tokenize recognize Other_ID_Start and Other_ID_Continue chars messages: + msg313852 resolution: duplicate stage: needs patch -> resolved |
| 2018年03月14日 08:02:11 | serhiy.storchaka | set | messages: + msg313814 |
| 2018年03月14日 01:08:57 | terry.reedy | set | nosy:
+ serhiy.storchaka messages: + msg313797 |
| 2018年03月13日 22:57:54 | cheryl.sabella | set | nosy:
+ cheryl.sabella messages: + msg313792 |
| 2018年03月09日 20:19:50 | terry.reedy | set | versions:
+ Python 3.7, Python 3.8 nosy: + terry.reedy messages: + msg313496 stage: needs patch |
| 2018年03月02日 23:32:49 | steve | create | |