This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2011年04月28日 17:02 by RobM, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Messages (6) | |||
|---|---|---|---|
| msg134700 - (view) | Author: Robert Meerman (RobM) | Date: 2011年04月28日 17:02 | |
Regular expressions which are written match literal underscores ("_", ASCII
ordinal 95) and specify `re.IGNORECASE` during compilation do not consistently
match underscores: it seems some occurrences are matched, but others are not.
The following session log shows the problem:
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> subject = "[Conclave-Mendoi]_ef_-_a_tale_of_memories_00-12_H264"
>>> print subject.encode("base64") # Incase my environment encoding is to blame
W0NvbmNsYXZlLU1lbmRvaV1fZWZfLV9hX3RhbGVfb2ZfbWVtb3JpZXNfMDAtMTJfSDI2NA==
>>> re.sub("_", "X", subject) # No flags, does what I expect
'[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
>>>
>>> re.sub("_", "X", subject, re.IGNORECASE) # Misses some matches
'[Conclave-Mendoi]XefX-_a_tale_of_memories_00-12_H264'
>>>
>>> re.sub("_", "X", subject, re.IGNORECASE | re.LOCALE) # Misses fewer matches
'[Conclave-Mendoi]XefX-XaXtaleXofXmemories_00-12_H264'
>>>
>>> re.sub("_", "X", subject, re.IGNORECASE | re.LOCALE | re.UNICODE) # Works OK
'[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
>>>
>>> re.sub("_", "X", subject, re.IGNORECASE | re.UNICODE) # Works OK
'[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
>>>
>>> type(subject) # Don't think this is a unicode string
<type 'str'>
>>>
Since my `subject` variable is of type `str` and only contains ASCII characters
I do not believe that the `re.UNICODE` flag should be required.
|
|||
| msg134716 - (view) | Author: Matthew Barnett (mrabarnett) * (Python triager) | Date: 2011年04月28日 19:54 | |
help(re.sub) says:
sub(pattern, repl, string, count=0)
and re.IGNORECASE has a value of 2.
Therefore this:
re.sub("_", "X", subject, re.IGNORECASE)
is telling it to replace at most 2 occurrences of "_".
|
|||
| msg134717 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年04月28日 20:49 | |
Closing as invalid. I wonder if it would be better to have count as a keyword-only argument though, since this problem seems to come up pretty often and it's not easy to debug. |
|||
| msg134723 - (view) | Author: Matthew Barnett (mrabarnett) * (Python triager) | Date: 2011年04月28日 22:21 | |
I don't know how much code that might break. It might not be that much; I can't remember when I last used re.sub without the default count. |
|||
| msg134752 - (view) | Author: Robert Meerman (RobM) | Date: 2011年04月29日 11:53 | |
Oh, that's embarrassing. :-) Could a type-check be used to alert the user to their mistake? I suppose that would require re.IGNORECASE (et al) to be of some new type (presumably sub-classed from Integer). (Thanks for the quick response, and sorry to waste your time) |
|||
| msg134831 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年04月30日 02:24 | |
See also #11957. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:16 | admin | set | github: 56156 |
| 2014年10月29日 16:15:11 | vstinner | set | superseder: re.sub confusion between count and flags args resolution: not a bug -> duplicate |
| 2011年04月30日 02:24:10 | ezio.melotti | set | messages: + msg134831 |
| 2011年04月29日 11:53:32 | RobM | set | messages: + msg134752 |
| 2011年04月28日 22:21:58 | mrabarnett | set | messages: + msg134723 |
| 2011年04月28日 20:49:19 | ezio.melotti | set | status: open -> closed resolution: not a bug messages: + msg134717 stage: resolved |
| 2011年04月28日 19:54:48 | mrabarnett | set | nosy:
+ mrabarnett messages: + msg134716 |
| 2011年04月28日 17:02:55 | RobM | create | |