Using a Regular Expression
There are several methods which are commonly used with regular
expressions. The most common first step is to compile the RE definition
string to make an Pattern object. The resulting
Pattern object can then be used to match or search candidate strings. A
successful match returns a Match object with
details of the matching substring.
The re module provides the
compile function.
-
re.compile (
expr
) →
Pattern
-
Create a Pattern object from an RE
string. The Pattern is used for all subsequent searching or
matching operations. A Pattern has several methods, including
match and search.
Generally, raw string notation (r"pattern") is
used to write a RE. This simplifies the \'s required.
Without the raw notation, each \ in the string would
have to be escaped by a \, making it
\\. This rapidly gets cumbersome. There are some
other options available for re.compile, see the
Python Library Reference, section 4.2, for more
information.
The following methods are part of a compiled
Pattern. We'll use the name
pat
to refer to some
Pattern object created by the
re.compile function.
-
pat.
match
(
string
) →
Match
-
Match the candidate string against the compiled regular
expression,
pat
.
Matching means that the regular expression and the candidate
string must match, starting at the beginning of the candidate
string. A Match object is returned if there
is match, otherwise None is returned.
-
pat.
search
(
string
) →
Match
-
Search a candidate string for the compiled regular
expression,
pat
.
Searching means that the regular expression must be found
somewhere in the candidate string. A Match
object is returned if the pattern is found, otherwise
None is returned.
If search or match finds
the pattern in the candidate string, a Match
object is created to describe the part of the candidate string which
matched. The following methods are part of a
Match object. We'll use the name
match
to refer to some
Match object created by a successul search or
match operation.
-
match.
group
(
number
) → string
-
Retrieve the string that matched a particular () grouping in
the regular expression. Group zero is a tuple of everything that
matched. Group 1 is the material that matched the first set of
()'s.
Here's a more complete example.
>>>
import re
>>>
rawin= "20:07:13.2"
>>>
hms_pat= re.compile( r'(\d+):(\d+):(\d+\.?\d*)' )
>>>
hms_match= hms_pat.match( rawin )
>>>
print hms_match.group( 0, 1, 2, 3 )
('20:07:13.2', '20', '07', '13.2')
>>>
h,m,s= map( float, hms_match.group(1,2,3) )
>>>
seconds= ((h*60)+m)*60+s
>>>
print h, m, s, "=", seconds
20.0 7.0 13.2 = 72433.2
This sequence decodes a complex input value into fields and then
computes a single result. The
import
statement
incorporates the re module. The
rawin variable is sample input, perhaps from a file,
perhaps from raw_input. The
hms_pat variable is the compiled regular expression
object which matches three numbers, using "(\d+)",
separated by :'s.
The digit-sequence RE's are surround by ()'s so that the material
that matched is returned as a group. This will lead to four groups:
group 0 is everything that matched, groups 1, 2, and 3 are successive
digit strings. The hms_match variable is a
Match object that indicates success or failure in
matching. If hms_match is None, no
match occurred. Otherwise, the hms_match.group
method will reveal the individually matched input items.
The statement that sets h,
m, and s does three things. First
is uses hms_match.group to create a tuple of
requested items. Each item in the tuple will be a string, so the
map function is used to apply the built-in
float function against each string to create a
tuple of three numbers. Finally, this statement relies on the
multiple-assignment feature to set all three variables at once. Finally,
seconds is computed as the number of seconds past
midnight for the given time stamp.