This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2008年05月12日 08:44 by sven.siegmund, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| reunicode.patch | pitrou, 2008年06月29日 01:19 | |||
| reunicode2.patch | pitrou, 2008年06月29日 20:21 | |||
| reunicode3.patch | pitrou, 2008年06月29日 20:36 | |||
| reunicode4.patch | pitrou, 2008年07月05日 21:09 | |||
| reunicode5.patch | pitrou, 2008年07月28日 16:39 | |||
| Messages (24) | |||
|---|---|---|---|
| msg66715 - (view) | Author: Sven Siegmund (sven.siegmund) | Date: 2008年05月12日 08:43 | |
re cannot ignore case of special latin characters:
Python 3.0a5 (py3k:62932M, May 9 2008, 16:23:11) [MSC v.1500 32 bit
(Intel)] on win32
>>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á'
True
>>> import re
>>> rx = re.compile('Á', re.IGNORECASE)
>>> rx.match('á') # should match but won't
>>> rx.match('Á') # will match
<_sre.SRE_Match object at 0x014B08A8>
>>> rx = re.compile('á', re.IGNORECASE)
>>> rx.match('Á') # should match but won't
>>> rx.match('á') # will match
<_sre.SRE_Match object at 0x014B08A8>
|
|||
| msg66727 - (view) | Author: Guido van Rossum (gvanrossum) * (Python committer) | Date: 2008年05月12日 14:44 | |
Try adding re.LOCALE to the flags. I'm not sure why that is needed but it seems to fix this issue. I still think this is a legitimate bug though. |
|||
| msg67622 - (view) | Author: Manuel Kaufmann (humitos) * | Date: 2008年06月02日 00:23 | |
I have the same error with the re.LOCALE flag...
[humitos] [~]$ python3.0
Python 3.0a5+ (py3k:63855, Jun 1 2008, 13:05:09)
[GCC 4.1.3 20080114 (prerelease) (Debian 4.1.2-19)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> rx = re.compile('á', re.LOCALE | re.IGNORECASE)
>>> rx.match('Á')
>>> rx.match('á')
<_sre.SRE_Match object at 0x2b955e204d30>
>>> rx = re.compile('Á', re.IGNORECASE | re.LOCALE)
>>> rx.match('Á')
<_sre.SRE_Match object at 0x2b955e204e00>
>>> rx.match('á')
>>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á'
True
>>>
|
|||
| msg68901 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年06月28日 19:40 | |
Same here, re.LOCALE doesn't circumvent the problem. |
|||
| msg68905 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年06月28日 20:27 | |
Uh, actually, it works if you specify re.UNICODE. If you don't, the
getlower() function in _sre.c falls back to the plain ASCII algorithm.
>>> pat = re.compile('Á', re.IGNORECASE | re.UNICODE)
>>> pat.match('á')
<_sre.SRE_Match object at 0xb7c66c28>
>>> pat.match('Á')
<_sre.SRE_Match object at 0xb7c66cd0>
I wonder if re.UNICODE shouldn't be the default in Py3k, at least when
the pattern is a string and not a bytes object. There may also be a
re.ASCII flag for those cases where people want to fallback to the old
behaviour.
|
|||
| msg68920 - (view) | Author: Guido van Rossum (gvanrossum) * (Python committer) | Date: 2008年06月28日 22:19 | |
Sounds like re.UNICODE should be on by default when the pattern is a str instance. Also (per mailing list discussion) we should probably only allow matching bytes when the pattern is bytes, and matching str when the pattern is str. Finally, is there a use case of re.LOCALE any more? I'm thinking not. |
|||
| msg68922 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年06月28日 22:35 | |
Le samedi 28 juin 2008 à 22:20 +0000, Guido van Rossum a écrit :
> Finally, is there a use case of re.LOCALE any more? I'm thinking not.
It's used for locale-specific case matching in the non-unicode case. But
it looks to me like a bad practice and we could probably remove it.
'C'
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE)
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE |re.LOCALE)
>>> locale.setlocale(locale.LC_CTYPE, 'fr_FR.ISO-8859-1')
'fr_FR.ISO-8859-1'
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE)
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE | re.LOCALE)
<_sre.SRE_Match object at 0xb7b9ac28>
|
|||
| msg68932 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年06月29日 01:15 | |
Here is a preliminary patch which doesn't remove re.LOCALE, but adds TypeError's for mistyped matchings, a ValueError when specifying re.UNICODE with a bytes pattern, and implies re.UNICODE for unicode patterns. The test suite runs fine after a few fixes. It also includes the patch for #3231 ("re.compile fails with some bytes patterns"). |
|||
| msg68966 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年06月29日 20:21 | |
This new patch also introduces re.ASCII as discussed on the mailing-list. |
|||
| msg68967 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年06月29日 20:36 | |
Improved patch which also detects incompatibilities for "(?u)". |
|||
| msg69298 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年07月05日 21:09 | |
This new patch adds re.ASCII in all sensitive places I could find in the stdlib (except lib2to3 which as far as I understand is maintained in a separate branch, and even has its own copy of tokenize.py...). Also, I didn't get an answer to the following question on the ML: should an inline flag "(?a)" be introduced to mirror the existing "(?u)" - so as to set the ASCII flag from inside a pattern string. |
|||
| msg69301 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年07月05日 21:30 | |
http://codereview.appspot.com/2439 |
|||
| msg70354 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年07月28日 16:39 | |
Final patch adding the (?a) inline flag (equivalent to re.ASCII). Please review: http://codereview.appspot.com/2439 |
|||
| msg70370 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2008年07月28日 20:41 | |
Are all those re.ASCII flags mandatory, or are they here just for theoretical correctness? For example, the output of "gcc -dumpversion" is certainly plain ASCII. I don't mind that \d also matches some exotic digit - it just won't happen. |
|||
| msg70371 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年07月28日 20:49 | |
Le lundi 28 juillet 2008 à 20:41 +0000, Amaury Forgeot d'Arc a écrit : > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > > Are all those re.ASCII flags mandatory, or are they here just for > theoretical correctness? For theoretical correctness. I just don't want to analyze each case individually and I'm probably not competent for many of them. |
|||
| msg70780 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年08月06日 10:29 | |
If nobody (except Amaury :-)) has anything to say about the current patch, should it be committed? |
|||
| msg70787 - (view) | Author: Guido van Rossum (gvanrossum) * (Python committer) | Date: 2008年08月06日 16:34 | |
Let's make sure the release manager is OK with this. |
|||
| msg71186 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年08月15日 21:31 | |
Barry? |
|||
| msg71413 - (view) | Author: Barry A. Warsaw (barry) * (Python committer) | Date: 2008年08月19日 12:57 | |
I haven't looked at the specific patch, but based on the description of the behavior, I'm +1 on committing this before beta 3. I'm fine with leaving the re.ASCII flags in there -- it will be a marker to indicate perhaps the code needs a closer examination (eventually). |
|||
| msg71414 - (view) | Author: Barry A. Warsaw (barry) * (Python committer) | Date: 2008年08月19日 12:58 | |
Make sure of course that the documentation is updated and a NEWS file entry is added. |
|||
| msg71455 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年08月19日 17:59 | |
Fixed in r65860. Someone should check the docs though (at least try to generate them, and review my changes a bit since English isn't my mother tongue). |
|||
| msg71516 - (view) | Author: Mark Summerfield (mark) * | Date: 2008年08月20日 07:36 | |
On 2008年08月19日, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > > Fixed in r65860. Someone should check the docs though (at least try to > generate them, and review my changes a bit since English isn't my mother > tongue). I've revised the ASCII and LOCALE-related texts in re.rst in r65903. |
|||
| msg71517 - (view) | Author: Mark Summerfield (mark) * | Date: 2008年08月20日 07:40 | |
On 2008年08月19日, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > > Fixed in r65860. Someone should check the docs though (at least try to > generate them, and review my changes a bit since English isn't my mother > tongue). And two more (tiny) fixes in r65904; that's my lot:-) |
|||
| msg71519 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年08月20日 08:49 | |
Thanks a lot Mark! |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:34 | admin | set | github: 47083 |
| 2009年02月13日 14:02:50 | ezio.melotti | set | nosy: + ezio.melotti |
| 2009年02月13日 13:52:31 | ocean-city | link | issue5239 dependencies |
| 2009年02月13日 12:34:49 | ocean-city | link | issue5240 dependencies |
| 2008年08月20日 08:49:38 | pitrou | set | messages: + msg71519 |
| 2008年08月20日 07:40:55 | mark | set | messages: + msg71517 |
| 2008年08月20日 07:36:30 | mark | set | messages: + msg71516 |
| 2008年08月19日 17:59:29 | pitrou | set | status: open -> closed resolution: accepted -> fixed messages: + msg71455 |
| 2008年08月19日 12:58:07 | barry | set | messages: + msg71414 |
| 2008年08月19日 12:57:41 | barry | set | resolution: accepted messages: + msg71413 |
| 2008年08月15日 21:31:11 | pitrou | set | messages: + msg71186 |
| 2008年08月06日 16:34:33 | gvanrossum | set | nosy:
+ barry messages: + msg70787 |
| 2008年08月06日 10:29:16 | pitrou | set | messages: + msg70780 |
| 2008年07月28日 20:49:16 | pitrou | set | messages: + msg70371 |
| 2008年07月28日 20:41:56 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages: + msg70370 |
| 2008年07月28日 16:39:31 | pitrou | set | files:
+ reunicode5.patch messages: + msg70354 |
| 2008年07月24日 15:07:53 | pitrou | set | priority: critical assignee: pitrou |
| 2008年07月24日 12:39:00 | mark | set | nosy: + mark |
| 2008年07月05日 21:30:04 | pitrou | set | messages: + msg69301 |
| 2008年07月05日 21:10:11 | pitrou | set | files:
+ reunicode4.patch messages: + msg69298 |
| 2008年06月29日 20:36:38 | pitrou | set | files:
+ reunicode3.patch messages: + msg68967 |
| 2008年06月29日 20:21:07 | pitrou | set | files:
+ reunicode2.patch messages: + msg68966 |
| 2008年06月29日 01:19:44 | pitrou | set | files: + reunicode.patch |
| 2008年06月29日 01:19:17 | pitrou | set | files: - reunicode.patch |
| 2008年06月29日 01:15:28 | pitrou | set | files:
+ reunicode.patch keywords: + patch messages: + msg68932 |
| 2008年06月28日 22:35:39 | pitrou | set | messages: + msg68922 |
| 2008年06月28日 22:19:03 | gvanrossum | set | messages: + msg68920 |
| 2008年06月28日 20:27:24 | pitrou | set | messages: + msg68905 |
| 2008年06月28日 19:40:35 | pitrou | set | nosy:
+ pitrou messages: + msg68901 |
| 2008年06月02日 00:23:02 | humitos | set | nosy:
+ humitos messages: + msg67622 |
| 2008年05月12日 14:44:03 | gvanrossum | set | nosy:
+ gvanrossum messages: + msg66727 |
| 2008年05月12日 08:44:03 | sven.siegmund | create | |