Message190326
| Author |
vstinner |
| Recipients |
BreamoreBoy, ezio.melotti, l0nwlf, lemburg, loewis, mrabarnett, nathanlmiles, rsc, terry.reedy, timehorse, vstinner |
| Date |
2013年05月29日.20:33:57 |
| SpamBayes Score |
-1.0 |
| Marked as misclassified |
Yes |
| Message-id |
<1369859638.44.0.439395953588.issue1693050@psf.upfronthosting.co.za> |
| In-reply-to |
| Content |
Let see Modules/_sre.c:
#define SRE_UNI_IS_ALNUM(ch) Py_UNICODE_ISALNUM(ch)
#define SRE_UNI_IS_WORD(ch) (SRE_UNI_IS_ALNUM(ch) || (ch) == '_')
>>> [ch.isalpha() for ch in '\u0939\u093f\u0928\u094d\u0926\u0940']
[True, False, True, False, True, False]
>>> import unicodedata
>>> [unicodedata.category(ch) for ch in '\u0939\u093f\u0928\u094d\u0926\u0940']
['Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Mc']
So the matching ends at U+093f because its category is a "spacing combining" (Mc), which is part of the Mark category, where the re module expects an alphanumeric character.
msg76557:
"""
Unicode TR#18 defines \w as a shorthand for
\p{alpha}
\p{gc=Mark}
\p{digit}
\p{gc=Connector_Punctuation}
"""
So if we want to respect this standard, the re module needs to be modified to accept other Unicode categories. |
|
History
|
|---|
| Date |
User |
Action |
Args |
| 2013年05月29日 20:33:58 | vstinner | set | recipients:
+ vstinner, lemburg, loewis, terry.reedy, nathanlmiles, rsc, timehorse, ezio.melotti, mrabarnett, l0nwlf, BreamoreBoy |
| 2013年05月29日 20:33:58 | vstinner | set | messageid: <1369859638.44.0.439395953588.issue1693050@psf.upfronthosting.co.za> |
| 2013年05月29日 20:33:58 | vstinner | link | issue1693050 messages |
| 2013年05月29日 20:33:58 | vstinner | create |
|