Issue 6561: Regex '\d' should not match unicode category 'No'.

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50810

classification

Title:	Regex '\d' should not match unicode category 'No'.
Type:	behavior	Stage:	resolved
Components:	Extension Modules	Versions:	Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:
Assigned To:	Nosy List:	eric.smith, ezio.melotti, lemburg, mark.dickinson, pitrou, r.david.murray
Priority:	normal	Keywords:	needs review, patch

Created on 2009年07月24日 10:48 by mark.dickinson, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue6561.patch	mark.dickinson, 2009年07月24日 16:36

Messages (8)
msg90878 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2009年07月24日 10:47
In Python 3, or in Python 2 with the re.UNICODE flag, it appears that the regex r'\d' matches all unicode characters with category either 'Nd' (Number, Decimal Digit) or 'No' (Number, Other), but not characters in category 'Nl' (Number, Letter): Python 3.2a0 (py3k:74188, Jul 23 2009, 16:01:29) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> import unicodedata >>> x = '\u2781' >>> unicodedata.category(x) 'No' >>> unicodedata.name(x) 'DINGBAT CIRCLED SANS-SERIF DIGIT TWO' >>> re.match(r'\d', '\u2781') <_sre.SRE_Match object at 0x3d5d08> I believe (but am not 100% sure) that r'\d' should only match characters in category 'Nd'. To back up this belief: (1) int and float currently accept characters in category 'Nd' but not 'No'; it would seem useful for '\d' to match those characters that are accepted by int, so that e.g., something matched with '\d+' could be directly passed to int. (This came up in a #python-dev discussion about whether the Decimal type should accept other unicode digits; that's a separate issue, though.) (2) In Perl 5.10 (and possibly some earlier versions too), '\d' matches only characters in category 'Nd' (3) Unicode Technical Standard #18 ("Unicode Regular Expressions") at http://unicode.org/unicode/reports/tr18/ recommends that '\d' should correspond to \p{gc=Decimal_Number} Marc-André, do you have any opinion on this? It's probably slightly dangerous to change this in 2.6 or 3.1; I'm proposing that '\d' should be modified to accept only characters of category 'Nd' in 2.7 and 3.2. (Thanks Ezio Melotti for finding all the references above and doing Perl testing!)
msg90885 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2009年07月24日 14:51
Patch against py3k.
msg90888 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2009年07月24日 16:36
New patch; same as before, but includes clarification to the documentation.
msg90927 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2009年07月25日 17:23
This sounds reasonable to me.
msg90929 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2009年07月25日 18:01
This seems to me quite redundant: + Matches any Unicode decimal digit; more specifically, matches + any character in Unicode category [Nd] (Number, Decimal Digit). + This includes ``[0-9]``, and also many other digit characters. I suggest something like: Matches the decimal digits ``[0-9]`` and all the characters that belong to the Unicode category Nd (Number, Decimal Digit). Two more minor details: instead of '\d', I'd use '^\d$' and instead of self.assertEqual(re.match('\d', x), None) self.assertIsNone(re.match('\d', x)).
msg90971 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2009年07月27日 02:23
It may be redundant, but it is also more technically accurate. I'm -0 on your proposed rephrasing, and trust Mark to make the right decision :)
msg91012 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2009年07月28日 17:23
[ezio.melotti] > I suggest something like: > Matches the decimal digits ``[0-9]`` and all the characters that belong > to the Unicode category Nd (Number, Decimal Digit). Hmm. I don't like this because it suggests (to me) that the characters [0-9] don't belong to category [Nd]. I agree the previous version was clunky, though. I've shortened it some; if anyone else wants to work on the wording please feel free. It might be nice to annotate each of these character classes (\w, \s) with the Unicode character categories that they correspond to. > Two more minor details: instead of '\d', I'd use '^\d$' and instead of > self.assertEqual(re.match('\d', x), None) > self.assertIsNone(re.match('\d', x)). Thanks. Changes applied. Committed to py3k, r74237. Leaving open for backport to trunk.
msg91018 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2009年07月28日 21:24
Backported to trunk in r74240.

History
Date	User	Action	Args
2022年04月11日 14:56:51	admin	set	github: 50810
2009年07月28日 21:24:48	mark.dickinson	set	status: open -> closed messages: + msg91018
2009年07月28日 17:23:36	mark.dickinson	set	stage: patch review -> resolved messages: + msg91012 versions: - Python 3.2
2009年07月27日 02:23:07	r.david.murray	set	nosy: + r.david.murray messages: + msg90971
2009年07月25日 18:01:50	ezio.melotti	set	priority: normal keywords: + needs review messages: + msg90929 stage: test needed -> patch review
2009年07月25日 17:23:37	pitrou	set	nosy: + pitrou messages: + msg90927
2009年07月24日 16:36:43	mark.dickinson	set	files: - issue6561.patch
2009年07月24日 16:36:30	mark.dickinson	set	files: + issue6561.patch messages: + msg90888
2009年07月24日 14:51:50	mark.dickinson	set	files: + issue6561.patch keywords: + patch messages: + msg90885
2009年07月24日 11:58:04	eric.smith	set	nosy: + eric.smith
2009年07月24日 10:48:00	mark.dickinson	create

homepage