Groups in regular expressions don't repeat as expected

Wed Apr 20 16:34:16 EDT 2011

On 4/20/2011 12:23 PM, Neil Cerutti wrote:
> On 2011年04月20日, John Nagle<nagle at animats.com> wrote:
>> Here's something that surprised me about Python regular expressions.
>>>>>>> krex = re.compile(r"^([a-z])+$")
>>>>> s = "abcdef"
>>>>> ms = krex.match(s)
>>>>> ms.groups()
>> ('f',)
>>>> The parentheses indicate a capturing group within the
>> regular expression, and the "+" indicates that the
>> group can appear one or more times. The regular
>> expression matches that way. But instead of returning
>> a captured group for each character, it returns only the
>> last one.
>>>> The documentation in fact says that, at
>>>> http://docs.python.org/library/re.html
>>>> "If a group is contained in a part of the pattern that matched multiple
>> times, the last match is returned."
>>>> That's kind of lame, though. I'd expect that there would be some way
>> to retrieve all matches.
>> .findall
>
 Findall does something a bit different. It returns a list of
matches of the entire pattern, not repeats of groups within
the pattern.
 Consider a regular expression for matching domain names:
 >>> kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
 >>> s = 'www.example.com'
 >>> ms = kre.match(s)
 >>> ms.groups()
('www', 'com')
 >>> msall = kre.findall(s)
 >>> msall
[('www', 'com')]
This is just a simple example. But it illustrates an unnecessary
limitation. The matcher can do the repeated matching; you just can't
get the results out.
				John Nagle