Message 313814 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	serhiy.storchaka
Recipients	cheryl.sabella, ezio.melotti, serhiy.storchaka, steve, terry.reedy, vstinner
Date	2018年03月14日.08:02:10
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1521014531.52.0.467229070634.issue32987@psf.upfronthosting.co.za>

Content
This issue and issue12486 doesn't have any common except that both are related to the tokenize module. There are two bugs: a too narrow definition of \w in the re module (see issue12731 and issue1693050) and a too narrow definition of Name in the tokenize module. >>> allchars = list(map(chr, range(0x110000))) >>> start = [c for c in allchars if c.isidentifier()] >>> cont = [c for c in allchars if ('a'+c).isidentifier()] >>> import re, regex, unicodedata >>> for c in regex.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in regex.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... '·' U+00B7 MIDDLE DOT '·' U+0387 GREEK ANO TELEIA '፩' U+1369 ETHIOPIC DIGIT ONE '፪' U+136A ETHIOPIC DIGIT TWO '፫' U+136B ETHIOPIC DIGIT THREE '፬' U+136C ETHIOPIC DIGIT FOUR '፭' U+136D ETHIOPIC DIGIT FIVE '፮' U+136E ETHIOPIC DIGIT SIX '፯' U+136F ETHIOPIC DIGIT SEVEN '፰' U+1370 ETHIOPIC DIGIT EIGHT '፱' U+1371 ETHIOPIC DIGIT NINE '᧚' U+19DA NEW TAI LUE THAM DIGIT ONE '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in re.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... 'ᢅ' U+1885 MONGOLIAN LETTER ALI GALI BALUDA 'ᢆ' U+1886 MONGOLIAN LETTER ALI GALI THREE BALUDA '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in re.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... '·' U+00B7 MIDDLE DOT '̀' U+0300 COMBINING GRAVE ACCENT '́' U+0301 COMBINING ACUTE ACCENT '̂' U+0302 COMBINING CIRCUMFLEX ACCENT '̃' U+0303 COMBINING TILDE ... [total 2177 characters] The second bug can be solved by adding 14 more characters in the pattern for Name. Name = r'[\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]+' or Name = r'[\w\u2118\u212e][\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]*' But first the issue with \w should be resolved (if we don't want to add 2177 characters). The other solution is implementing property support in re (issue12734).

Content

This issue and issue12486 doesn't have any common except that both are related to the tokenize module.
There are two bugs: a too narrow definition of \w in the re module (see issue12731 and issue1693050) and a too narrow definition of Name in the tokenize module.
>>> allchars = list(map(chr, range(0x110000)))
>>> start = [c for c in allchars if c.isidentifier()]
>>> cont = [c for c in allchars if ('a'+c).isidentifier()]
>>> import re, regex, unicodedata
>>> for c in regex.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'℘' U+2118 SCRIPT CAPITAL P
'℮' U+212E ESTIMATED SYMBOL
>>> for c in regex.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'·' U+00B7 MIDDLE DOT
'·' U+0387 GREEK ANO TELEIA
'፩' U+1369 ETHIOPIC DIGIT ONE
'፪' U+136A ETHIOPIC DIGIT TWO
'፫' U+136B ETHIOPIC DIGIT THREE
'፬' U+136C ETHIOPIC DIGIT FOUR
'፭' U+136D ETHIOPIC DIGIT FIVE
'፮' U+136E ETHIOPIC DIGIT SIX
'፯' U+136F ETHIOPIC DIGIT SEVEN
'፰' U+1370 ETHIOPIC DIGIT EIGHT
'፱' U+1371 ETHIOPIC DIGIT NINE
'᧚' U+19DA NEW TAI LUE THAM DIGIT ONE
'℘' U+2118 SCRIPT CAPITAL P
'℮' U+212E ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'ᢅ' U+1885 MONGOLIAN LETTER ALI GALI BALUDA
'ᢆ' U+1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
'℘' U+2118 SCRIPT CAPITAL P
'℮' U+212E ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'·' U+00B7 MIDDLE DOT
'̀' U+0300 COMBINING GRAVE ACCENT
'́' U+0301 COMBINING ACUTE ACCENT
'̂' U+0302 COMBINING CIRCUMFLEX ACCENT
'̃' U+0303 COMBINING TILDE
...
[total 2177 characters]
The second bug can be solved by adding 14 more characters in the pattern for Name.
 Name = r'[\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]+'
or
 Name = r'[\w\u2118\u212e][\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]*'
But first the issue with \w should be resolved (if we don't want to add 2177 characters).
The other solution is implementing property support in re (issue12734).

History
Date	User	Action	Args
2018年03月14日 08:02:11	serhiy.storchaka	set	recipients: + serhiy.storchaka, terry.reedy, vstinner, ezio.melotti, cheryl.sabella, steve
2018年03月14日 08:02:11	serhiy.storchaka	set	messageid: <1521014531.52.0.467229070634.issue32987@psf.upfronthosting.co.za>
2018年03月14日 08:02:11	serhiy.storchaka	link	issue32987 messages
2018年03月14日 08:02:10	serhiy.storchaka	create

homepage