Issue 24194: Make tokenize recognize Other_ID_Start and Other_ID_Continue chars

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68382

classification

Title:	Make tokenize recognize Other_ID_Start and Other_ID_Continue chars
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.11, Python 3.10, Python 3.9

process

Status:	open	Resolution:
Dependencies:	12731 12734	Superseder:
Assigned To:	terry.reedy	Nosy List:	Joshua.Landau, iritkatriel, meador.inge, terry.reedy
Priority:	normal	Keywords:	patch

Created on 2015年05月14日 13:00 by Joshua.Landau, last changed 2022年04月11日 14:58 by admin.

Files
File name	Uploaded	Description	Edit
issue24194-v0.patch	meador.inge, 2016年05月11日 01:45	review

Messages (5)
msg243188 - (view)	Author: Joshua Landau (Joshua.Landau) *	Date: 2015年05月14日 13:00
This is valid: ℘· = 1 print(℘·) #>>> 1 But this gives an error token: from io import BytesIO from tokenize import tokenize stream = BytesIO("℘·".encode("utf-8")) print(*tokenize(stream.read), sep="\n") #>>> TokenInfo(type=56 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') #>>> TokenInfo(type=53 (ERRORTOKEN), string='℘', start=(1, 0), end=(1, 1), line='℘·') #>>> TokenInfo(type=53 (ERRORTOKEN), string='·', start=(1, 1), end=(1, 2), line='℘·') #>>> TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='') This is a continuation of http://bugs.python.org/issue9712. I'm not able to reopen the issue, so I thought I should report it anew. It is tokenize that is wrong - Other_ID_Start and Other_ID_Continue are documented to be valid: https://docs.python.org/3.5/reference/lexical_analysis.html#identifiers
msg265286 - (view)	Author: Meador Inge (meador.inge) * (Python committer)	Date: 2016年05月11日 01:45
Attached is a first cut patch for this. (CC'd haypo as a unicode expert).
msg313851 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2018年03月15日 00:55
I closed #1693050 as a duplicate of #12731 (the /w issue). I left #9712 closed and closed #32987 and marked both as duplicates of this. In msg313814 of the latter, Serhiy indicates which start and continue identifier characters are currently matched by \W for re and regex. He gives there a fix for this that he says requires the /w issue to be fixed. It is similar to the posted patch. He says that without \w fixed, another 2000+ chars need to be added. Perhaps the v0 patch needs more tests (I don't know.) He also says that re support for properties, #12734, would make things even better. Three of the characters in the patch are too obscure for Firefox on Window2 and print as boxes. Some others I do not recognize. And I could not type any of them. I thought we had a policy of using \u or \U escapes even in tests to avoid such problems. (I notice that there are already non-ascii chars in the context.)
msg410718 - (view)	Author: Irit Katriel (iritkatriel) * (Python committer)	Date: 2022年01月16日 19:57
Reproduced on 3.11.
msg410731 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2022年01月16日 23:10
Udated doc link, which appears to be same: https://docs.python.org/3.11/reference/lexical_analysis.html#identifiers Updated property list linked in above: https://www.unicode.org/Public/14.0.0/ucd/PropList.txt Relevant content for this issue: 1885..1886 ; Other_ID_Start # Mn [2] MONGOLIAN LETTER ALI GALI BALUDA..MONGOLIAN LETTER ALI GALI THREE BALUDA 2118 ; Other_ID_Start # Sm SCRIPT CAPITAL P 212E ; Other_ID_Start # So ESTIMATED SYMBOL 309B..309C ; Other_ID_Start # Sk [2] KATAKANA-HIRAGANA VOICED SOUND MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK # Total code points: 6 00B7 ; Other_ID_Continue # Po MIDDLE DOT 0387 ; Other_ID_Continue # Po GREEK ANO TELEIA 1369..1371 ; Other_ID_Continue # No [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE 19DA ; Other_ID_Continue # No NEW TAI LUE THAM DIGIT ONE # Total code points: 12 Codepoints of '℘·' opening example: '0x2118' Other_Id_start Sm Script Capital P '0xb7' Other_Id_continue P0 Middle dot Except for the two Mongolian start characters, Meador's patch hardcodes the 'Other' characters, thereby adding them without waiting for re to be fixed. While this will miss new additions without manual updates, it is better than missing everything for however many years. I will make a PR with the additions and looks at the new tests.

History
Date	User	Action	Args
2022年04月11日 14:58:16	admin	set	github: 68382
2022年01月17日 11:25:14	vstinner	set	nosy: - vstinner
2022年01月16日 23:14:48	terry.reedy	set	assignee: meador.inge -> terry.reedy
2022年01月16日 23:13:45	terry.reedy	set	title: tokenize fails on some Other_ID_Start or Other_ID_Continue -> Make tokenize recognize Other_ID_Start and Other_ID_Continue chars
2022年01月16日 23:10:56	terry.reedy	set	messages: + msg410731
2022年01月16日 19:57:17	iritkatriel	set	nosy: + iritkatriel messages: + msg410718 versions: + Python 3.9, Python 3.10, Python 3.11, - Python 3.6, Python 3.7, Python 3.8
2018年03月15日 00:58:29	terry.reedy	link	issue32987 superseder
2018年03月15日 00:55:40	terry.reedy	set	nosy: + terry.reedy title: tokenize yield an ERRORTOKEN if an identifier uses Other_ID_Start or Other_ID_Continue -> tokenize fails on some Other_ID_Start or Other_ID_Continue messages: + msg313851 versions: + Python 3.7, Python 3.8, - Python 3.5
2018年03月15日 00:32:39	terry.reedy	link	issue1693050 superseder
2018年03月15日 00:12:21	terry.reedy	link	issue9712 superseder
2016年05月11日 01:45:33	meador.inge	set	files: + issue24194-v0.patch assignee: meador.inge keywords: + patch nosy: + vstinner, meador.inge messages: + msg265286 stage: needs patch -> patch review
2016年04月25日 06:08:17	serhiy.storchaka	set	dependencies: + python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a, Request for property support in Python re lib
2016年04月25日 06:05:43	serhiy.storchaka	set	stage: needs patch versions: + Python 3.5, Python 3.6, - Python 3.4
2016年04月25日 06:05:00	serhiy.storchaka	link	issue26843 superseder
2015年05月14日 13:00:27	Joshua.Landau	create

homepage