homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tokenize.py parses unicode identifiers incorrectly
Type: behavior Stage: resolved
Components: Library (Lib), Unicode Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Make tokenize recognize Other_ID_Start and Other_ID_Continue chars
View: 24194
Assigned To: Nosy List: cheryl.sabella, ezio.melotti, serhiy.storchaka, steve, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2018年03月02日 23:32 by steve, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Messages (6)
msg313168 - (view) Author: Steve B (steve) Date: 2018年03月02日 23:32
Here is an example involving the unicode character MIDDLE DOT · : The line
ab·cd = 7
is valid Python 3 code and is happily accepted by the CPython interpreter. However, tokenize.py does not like it. It says that the middle-dot is an error token. Here is an example you can run to see that:
 import tokenize
 from io import BytesIO
 
 test = 'ab·cd = 7'.encode('utf-8')
 
 x = tokenize.tokenize(BytesIO(test).readline)
 for i in x: print(i)
For reference, the official definition of identifiers is: 
https://docs.python.org/3.6/reference/lexical_analysis.html#identifiers
and details about MIDDLE DOT are at
https://www.unicode.org/Public/10.0.0/ucd/PropList.txt
MIDDLE DOT has the "Other_ID_Continue" property, so I think the interpreter is behaving correctly (i.e. consistent with the documented spec), while tokenize.py is wrong.
msg313496 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018年03月09日 20:19
I verified on Win10 with 3.5 (which cannot be patched) and 3.7.0b2 that ab·cd is accepted as a name and that tokenize fails as described.
msg313792 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2018年03月13日 22:57
I believe this may be a duplicate of issue 12486.
msg313797 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018年03月14日 01:08
I think the issues are slightly different. #12486 is about the awkwardness of the API. This is about a false error after jumping through the hoops, which I think Steve B did correctly.
Following the link, the Other_ID_Continue chars are
00B7 ; Other_ID_Continue # Po MIDDLE DOT
0387 ; Other_ID_Continue # Po GREEK ANO TELEIA
1369..1371 ; Other_ID_Continue # No [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE
19DA ; Other_ID_Continue # No NEW TAI LUE THAM DIGIT ONE
# Total code points: 12
The 2 Po chars fail, the 2 No chars work. After looking at the tokenize module, I believe the problem is the re for Name is r'\w+' and the Po chars are not seen as \w word characters.
>>> r = re.compile(r'\w+', re.U) 
>>> re.match(r, 'ab\u0387cd')
<re.Match object; span=(0, 2), match='ab'>
I don't know if the bug is a too narrow definition of \w in the re module("most characters that can be part of a word in any language, as well as numbers and the underscore") or of Name in the tokenize module.
Before patching anything, I would like to know if the 2 Po Other chars are the only 2 not matched by \w. Unless someone has done so already, at least a sample of chars from each category included in the definition of 'identifier' should be tested.
msg313814 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018年03月14日 08:02
This issue and issue12486 doesn't have any common except that both are related to the tokenize module.
There are two bugs: a too narrow definition of \w in the re module (see issue12731 and issue1693050) and a too narrow definition of Name in the tokenize module.
>>> allchars = list(map(chr, range(0x110000)))
>>> start = [c for c in allchars if c.isidentifier()]
>>> cont = [c for c in allchars if ('a'+c).isidentifier()]
>>> import re, regex, unicodedata
>>> for c in regex.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'℘' U+2118 SCRIPT CAPITAL P
'℮' U+212E ESTIMATED SYMBOL
>>> for c in regex.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'·' U+00B7 MIDDLE DOT
'·' U+0387 GREEK ANO TELEIA
'፩' U+1369 ETHIOPIC DIGIT ONE
'፪' U+136A ETHIOPIC DIGIT TWO
'፫' U+136B ETHIOPIC DIGIT THREE
'፬' U+136C ETHIOPIC DIGIT FOUR
'፭' U+136D ETHIOPIC DIGIT FIVE
'፮' U+136E ETHIOPIC DIGIT SIX
'፯' U+136F ETHIOPIC DIGIT SEVEN
'፰' U+1370 ETHIOPIC DIGIT EIGHT
'፱' U+1371 ETHIOPIC DIGIT NINE
'᧚' U+19DA NEW TAI LUE THAM DIGIT ONE
'℘' U+2118 SCRIPT CAPITAL P
'℮' U+212E ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'ᢅ' U+1885 MONGOLIAN LETTER ALI GALI BALUDA
'ᢆ' U+1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
'℘' U+2118 SCRIPT CAPITAL P
'℮' U+212E ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'·' U+00B7 MIDDLE DOT
'̀' U+0300 COMBINING GRAVE ACCENT
'́' U+0301 COMBINING ACUTE ACCENT
'̂' U+0302 COMBINING CIRCUMFLEX ACCENT
'̃' U+0303 COMBINING TILDE
...
[total 2177 characters]
The second bug can be solved by adding 14 more characters in the pattern for Name.
 Name = r'[\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]+'
or
 Name = r'[\w\u2118\u212e][\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]*'
But first the issue with \w should be resolved (if we don't want to add 2177 characters).
The other solution is implementing property support in re (issue12734).
msg313852 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018年03月15日 00:58
#24194 is about tokenize failing, including on middle dot. There is another tokenize name issue, already closed. I referenced Serhiy's analysis there and on the two \w issues, and closed one of them.
History
Date User Action Args
2022年04月11日 14:58:58adminsetgithub: 77168
2018年03月15日 00:58:29terry.reedysetstatus: open -> closed
superseder: Make tokenize recognize Other_ID_Start and Other_ID_Continue chars
messages: + msg313852

resolution: duplicate
stage: needs patch -> resolved
2018年03月14日 08:02:11serhiy.storchakasetmessages: + msg313814
2018年03月14日 01:08:57terry.reedysetnosy: + serhiy.storchaka
messages: + msg313797
2018年03月13日 22:57:54cheryl.sabellasetnosy: + cheryl.sabella
messages: + msg313792
2018年03月09日 20:19:50terry.reedysetversions: + Python 3.7, Python 3.8
nosy: + terry.reedy

messages: + msg313496

stage: needs patch
2018年03月02日 23:32:49stevecreate

AltStyle によって変換されたページ (->オリジナル) /