This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009年07月24日 10:48 by mark.dickinson, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| issue6561.patch | mark.dickinson, 2009年07月24日 16:36 | |||
| Messages (8) | |||
|---|---|---|---|
| msg90878 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2009年07月24日 10:47 | |
In Python 3, or in Python 2 with the re.UNICODE flag, it appears that
the regex r'\d' matches all unicode characters with category either 'Nd'
(Number, Decimal Digit) or 'No' (Number, Other), but not characters in
category 'Nl' (Number, Letter):
Python 3.2a0 (py3k:74188, Jul 23 2009, 16:01:29)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> import unicodedata
>>> x = '\u2781'
>>> unicodedata.category(x)
'No'
>>> unicodedata.name(x)
'DINGBAT CIRCLED SANS-SERIF DIGIT TWO'
>>> re.match(r'\d', '\u2781')
<_sre.SRE_Match object at 0x3d5d08>
I believe (but am not 100% sure) that r'\d' should only match characters
in category 'Nd'. To back up this belief:
(1) int and float currently accept characters in category 'Nd' but not
'No'; it would seem useful for '\d' to match those characters that are
accepted by int, so that e.g., something matched with '\d+' could be
directly passed to int. (This came up in a #python-dev discussion
about whether the Decimal type should accept other unicode digits;
that's a separate issue, though.)
(2) In Perl 5.10 (and possibly some earlier versions too), '\d' matches
only characters in category 'Nd'
(3) Unicode Technical Standard #18 ("Unicode Regular Expressions") at
http://unicode.org/unicode/reports/tr18/ recommends that '\d' should
correspond to \p{gc=Decimal_Number}
Marc-André, do you have any opinion on this?
It's probably slightly dangerous to change this in 2.6 or 3.1; I'm
proposing that '\d' should be modified to accept only characters of
category 'Nd' in 2.7 and 3.2.
(Thanks Ezio Melotti for finding all the references above and doing Perl
testing!)
|
|||
| msg90885 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2009年07月24日 14:51 | |
Patch against py3k. |
|||
| msg90888 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2009年07月24日 16:36 | |
New patch; same as before, but includes clarification to the documentation. |
|||
| msg90927 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2009年07月25日 17:23 | |
This sounds reasonable to me. |
|||
| msg90929 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2009年07月25日 18:01 | |
This seems to me quite redundant:
+ Matches any Unicode decimal digit; more specifically, matches
+ any character in Unicode category [Nd] (Number, Decimal Digit).
+ This includes ``[0-9]``, and also many other digit characters.
I suggest something like:
Matches the decimal digits ``[0-9]`` and all the characters that belong
to the Unicode category Nd (Number, Decimal Digit).
Two more minor details: instead of '\d', I'd use '^\d$' and instead of
self.assertEqual(re.match('\d', x), None)
self.assertIsNone(re.match('\d', x)).
|
|||
| msg90971 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2009年07月27日 02:23 | |
It may be redundant, but it is also more technically accurate. I'm -0 on your proposed rephrasing, and trust Mark to make the right decision :) |
|||
| msg91012 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2009年07月28日 17:23 | |
[ezio.melotti]
> I suggest something like:
> Matches the decimal digits ``[0-9]`` and all the characters that belong
> to the Unicode category Nd (Number, Decimal Digit).
Hmm. I don't like this because it suggests (to me) that the characters
[0-9] don't belong to category [Nd]. I agree the previous version was
clunky, though. I've shortened it some; if anyone else wants to work on
the wording please feel free. It might be nice to annotate each of these
character classes (\w, \s) with the Unicode character categories that they
correspond to.
> Two more minor details: instead of '\d', I'd use '^\d$' and instead of
> self.assertEqual(re.match('\d', x), None)
> self.assertIsNone(re.match('\d', x)).
Thanks. Changes applied.
Committed to py3k, r74237. Leaving open for backport to trunk.
|
|||
| msg91018 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2009年07月28日 21:24 | |
Backported to trunk in r74240. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:51 | admin | set | github: 50810 |
| 2009年07月28日 21:24:48 | mark.dickinson | set | status: open -> closed messages: + msg91018 |
| 2009年07月28日 17:23:36 | mark.dickinson | set | stage: patch review -> resolved messages: + msg91012 versions: - Python 3.2 |
| 2009年07月27日 02:23:07 | r.david.murray | set | nosy:
+ r.david.murray messages: + msg90971 |
| 2009年07月25日 18:01:50 | ezio.melotti | set | priority: normal keywords: + needs review messages: + msg90929 stage: test needed -> patch review |
| 2009年07月25日 17:23:37 | pitrou | set | nosy:
+ pitrou messages: + msg90927 |
| 2009年07月24日 16:36:43 | mark.dickinson | set | files: - issue6561.patch |
| 2009年07月24日 16:36:30 | mark.dickinson | set | files:
+ issue6561.patch messages: + msg90888 |
| 2009年07月24日 14:51:50 | mark.dickinson | set | files:
+ issue6561.patch keywords: + patch messages: + msg90885 |
| 2009年07月24日 11:58:04 | eric.smith | set | nosy:
+ eric.smith |
| 2009年07月24日 10:48:00 | mark.dickinson | create | |