Issue 1693050: \w not helpful for non-Roman scripts

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/44795

classification

Title:	\w not helpful for non-Roman scripts
Type:	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 3.3, Python 3.4, Python 2.7

process

Status:	closed	Resolution:	duplicate
Dependencies:	Superseder:	Make tokenize recognize Other_ID_Start and Other_ID_Continue chars View: 24194
Assigned To:	Nosy List:	ezio.melotti, l0nwlf, lemburg, loewis, mrabarnett, nathanlmiles, rsc, terry.reedy, timehorse, vstinner
Priority:	normal	Keywords:

Created on 2007年04月02日 15:27 by nathanlmiles, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Messages (15)
msg31688 - (view)	Author: nlmiles (nathanlmiles)	Date: 2007年04月02日 15:27
When I try to use r"\w+(?u)" to find words in a unicode Devanagari text bad things happen. Words get chopped into small pieces. I think this is likely because vowel signs such as 093e are not considered to match \w. I think that if you wish \w to be useful for Indic scipts \w will need to be exanded to unclude unicode character categories Mc, Mn, Me. I am using Python 2.4.4 on Windows XP SP2. I ran the following script to see the characters which I think ought to match \w but don't import re import unicodedata text = "" for i in range(0x901,0x939): text += unichr(i) for i in range(0x93c,0x93d): text += unichr(i) for i in range(0x93e,0x94d): text += unichr(i) for i in range(0x950,0x954): text += unichr(i) for i in range(0x958,0x963): text += unichr(i) parts = re.findall("\W(?u)", text) for ch in parts: print "%04x" % ord(ch), unicodedata.category(ch) The odd character here is 0904. Its categorization seems to imply that you are using the uncode 3.0 database but perhaps later versions of Python are using the current 5.0 database.
msg31689 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2007年04月02日 15:38
Python 2.4 is using Unicode 3.2. Python 2.5 ships with Unicode 4.1. We're likely to ship Unicode 5.x with Python 2.6 or a later release. Regarding the char classes: I don't think Mc, Mn and Me should be considered parts of a word. Those are marks which usually separate words.
msg76556 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2008年11月28日 21:14
Vowel 'marks' are condensed vowel characters and are very much part of words and do not separate words. Python3 properly includes Mn and Mc as identifier characters. http://docs.python.org/dev/3.0/reference/lexical_analysis.html#identifiers-and-keywords For instance, the word 'hindi' has 3 consonants 'h', 'n', 'd', 2 vowels 'i' and 'ii' (long i) following 'h' and 'd', and a null vowel (virama) after 'n'. [The null vowel is needed because no vowel mark indicates the default vowel short a. So without it, the word would be hinadii.] The difference between the devanagari vowel characters, used at the beginning of words, and the vowel marks, used thereafter, is purely graphical and not phonological. In short, in the sanskrit family, word = syllable+ syllable = vowel \| consonant + vowel mark From a clp post asking why re does not see hindi as a word: हिन्दी ह DEVANAGARI LETTER HA (Lo) ि DEVANAGARI VOWEL SIGN I (Mc) न DEVANAGARI LETTER NA (Lo) ् DEVANAGARI SIGN VIRAMA (Mn) द DEVANAGARI LETTER DA (Lo) ी DEVANAGARI VOWEL SIGN II (Mc) .isapha and possibly other unicode methods need fixing also >>> 'हिन्दी'.isalpha()#2.x and 3.0 False
msg76557 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2008年11月28日 21:33
Unicode TR#18 defines \w as a shorthand for \p{alpha} \p{gc=Mark} \p{digit} \p{gc=Connector_Punctuation} which would include all marks. We should recursively check whether we follow the recommendation (e.g. \p{alpha} refers to all character having the Alphabetic derived core property, which is Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic, where Other_Alphabetic is a selected list of additional character - all from Mn/Mc)
msg81221 - (view)	Author: Matthew Barnett (mrabarnett) * (Python triager)	Date: 2009年02月05日 19:51
In issue #2636 I'm using the following: Alpha is Ll, Lo, Lt, Lu. Digit is Nd. Word is Ll, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc. These are what are specified at http://www.regular-expressions.info/posixbrackets.html
msg190075 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2013年05月26日 11:02
Am I correct in saying that this must stay open as it targets the re module but as given in msg81221 is fixed in the new regex module?
msg190100 - (view)	Author: Matthew Barnett (mrabarnett) * (Python triager)	Date: 2013年05月26日 16:56
I had to check what re does in Python 3.3: >>> print(len(re.match(r'\w+', 'हिन्दी').group())) 1 Regex does this: >>> print(len(regex.match(r'\w+', 'हिन्दी').group())) 6
msg190219 - (view)	Author: Jeffrey C. Jacobs (timehorse)	Date: 2013年05月28日 15:22
Matthew, I think that is considered a single word in Sanscrit or Thai so Python 3.x is correct. In this case you've written the Sanscrit word for Hindi.
msg190226 - (view)	Author: Matthew Barnett (mrabarnett) * (Python triager)	Date: 2013年05月28日 16:51
I'm not sure what you're saying. The re module in Python 3.3 matches only the first codepoint, treating the second codepoint as not part of a word, whereas the regex module matches all 6 codepoints, treating them all as part of a single word.
msg190268 - (view)	Author: Jeffrey C. Jacobs (timehorse)	Date: 2013年05月29日 03:34
Maybe you could show us the byte-for-byte hex of the string you're testing so we can examine if it's really a code point intending word boundary or just a code point for the sake of beginning a new character.
msg190322 - (view)	Author: Matthew Barnett (mrabarnett) * (Python triager)	Date: 2013年05月29日 17:31
You could've obtained it from msg76556 or msg190100: >>> print(ascii('हिन्दी')) '\u0939\u093f\u0928\u094d\u0926\u0940' >>> import re, regex >>> print(ascii(re.match(r"\w+", '\u0939\u093f\u0928\u094d\u0926\u0940').group())) '\u0939' >>> print(ascii(regex.match(r"\w+", '\u0939\u093f\u0928\u094d\u0926\u0940').group())) '\u0939\u093f\u0928\u094d\u0926\u0940'
msg190323 - (view)	Author: Jeffrey C. Jacobs (timehorse)	Date: 2013年05月29日 18:23
Thanks Matthew and sorry to put you through more work; I just wanted to verify exactly which unicode (UTF-16 I take it) were being used to verify if the UNICODE standard expected them to be treated as unique words or single letters within a word. Sanskrit is an alphabet, not an ideograph so each symbol is considered a letter. So I believe your implementation is correct and yes, you are right, re is at fault. There are just accenting characters and letters in that sequence so they should be interpreted as a single word of 6 letters, as you determine, and not one of the first letter. Mind you, I misinterpreted msg190100 in that I thought you were using findall in which case the answer should be 1, but as far as length of extraction, yes, 6, I totally agree. Sorry for the misunderstanding. http://www.unicode.org/charts/PDF/U0900.pdf contains the code chart for Hindi.
msg190324 - (view)	Author: Matthew Barnett (mrabarnett) * (Python triager)	Date: 2013年05月29日 18:46
UTF-16 has nothing to do with it, that's just an encoding (a pair of them actually, UTF-16LE and UTF-16BE). And I don't know why you thought I was using findall in msg190100 when the examples were using match! :-)
msg190326 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2013年05月29日 20:33
Let see Modules/_sre.c: #define SRE_UNI_IS_ALNUM(ch) Py_UNICODE_ISALNUM(ch) #define SRE_UNI_IS_WORD(ch) (SRE_UNI_IS_ALNUM(ch) \|\| (ch) == '_') >>> [ch.isalpha() for ch in '\u0939\u093f\u0928\u094d\u0926\u0940'] [True, False, True, False, True, False] >>> import unicodedata >>> [unicodedata.category(ch) for ch in '\u0939\u093f\u0928\u094d\u0926\u0940'] ['Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Mc'] So the matching ends at U+093f because its category is a "spacing combining" (Mc), which is part of the Mark category, where the re module expects an alphanumeric character. msg76557: """ Unicode TR#18 defines \w as a shorthand for \p{alpha} \p{gc=Mark} \p{digit} \p{gc=Connector_Punctuation} """ So if we want to respect this standard, the re module needs to be modified to accept other Unicode categories.
msg313849 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2018年03月15日 00:32
Whatever I may have said before, I favor supporting the Unicode standard for \w, which is related to the standard for identifiers. This is one of 2 issues about \w being defined too narrowly. I am somewhat arbitrarily closing this as a duplicate of #12731 (fewer digits ;-). There are 3 issues about tokenize.tokenize failing on valid identifiers, defined as \w sequences whose first char is an identifier itself (and therefore a start char). In msg313814 of #32987, Serhiy indicates which start and continue identifier characters are matched by \W for re and regex. I am leaving #24194 open as the tokenizer name issue.

History
Date	User	Action	Args
2022年04月11日 14:56:23	admin	set	github: 44795
2018年03月15日 00:32:39	terry.reedy	set	status: open -> closed superseder: Make tokenize recognize Other_ID_Start and Other_ID_Continue chars messages: + msg313849 resolution: duplicate stage: resolved
2014年02月03日 17:08:44	BreamoreBoy	set	nosy: - BreamoreBoy
2013年05月29日 20:33:58	vstinner	set	messages: + msg190326
2013年05月29日 18:46:46	mrabarnett	set	messages: + msg190324
2013年05月29日 18:23:52	timehorse	set	messages: + msg190323
2013年05月29日 17:31:08	mrabarnett	set	messages: + msg190322
2013年05月29日 03:34:40	timehorse	set	messages: + msg190268
2013年05月28日 16:51:51	mrabarnett	set	messages: + msg190226
2013年05月28日 15:22:33	timehorse	set	messages: + msg190219
2013年05月26日 21:09:23	terry.reedy	set	versions: + Python 3.3, Python 3.4, - Python 3.1
2013年05月26日 16:56:19	mrabarnett	set	messages: + msg190100
2013年05月26日 11:02:14	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg190075
2010年03月31日 01:29:17	l0nwlf	set	nosy: + l0nwlf
2010年03月05日 15:37:50	vstinner	set	nosy: + vstinner
2009年05月12日 14:41:55	ezio.melotti	set	nosy: + ezio.melotti
2009年02月05日 19:51:20	mrabarnett	set	nosy: + mrabarnett messages: + msg81221
2008年11月28日 21:33:40	loewis	set	nosy: + loewis messages: + msg76557
2008年11月28日 21:14:55	terry.reedy	set	nosy: + terry.reedy messages: + msg76556 versions: + Python 3.1
2008年09月28日 19:20:16	timehorse	set	nosy: + timehorse versions: + Python 2.7, - Python 2.4
2008年04月24日 21:07:01	rsc	set	nosy: + rsc
2007年04月02日 15:27:11	nathanlmiles	create

homepage