Programmer's Python Data - Simple Regular Expressions

Written by Mike James

Monday, 09 December 2024

Article Index
Programmer's Python Data - Simple Regular Expressions
Pattern Matching
Grouping and Alternation

Page 3 of 3

If you don’t want greedy quantifiers the solution is to use "lazy" quantifiers which are formed by following any of the standard quantifiers by a question mark, ?. To see this in action, change the previous regular expression to read:

ex2 = re.compile(r"<div>.*?</div>")

With this change in place, the result of matching to:

r"<div>hello</div>world</div>"

is just the first pair of <div> brackets, that is <div>hello</div>.

All of the quantifiers, including ?, have a lazy version and you can write ?? to mean a lazy "zero or one" occurrence.

The distinction between greedy and lazy quantifiers is perhaps the biggest reason for a reasonably well-tested regular expression to go wrong when used against a wider range of example strings. Always remember that a standard greedy quantifier will match as many times as possible while still allowing the regular expression to match, and its lazy version will match as few as possible times to make the regular expression match.

Grouping and Alternation

Regular strings often have alternative forms. For example, the ISBN prefix could be simply ISBN: or it could be ISBN-13: or any of many other reasonable variations. You can specify an either/or situation using the vertical bar |, the alternation operator, as in x|y which will match an x or a y.

For example, r"ISBN:|ISBN-13:" matches either ISBN: or ISBN-13:.

This is easy enough but what about:

r"ISBN:|ISBN-13:\s*\d"

At first glance this seems to match either ISBN: or ISBN-13: followed by any number of white space characters and a single digit, – but it doesn’t. The | operator has the lowest priority and the alternative matches are everything the left and everything to the right, i.e. either ISBN: or ISBN-13:\s*\d. To match the white space and digit in both forms of the ISBN prefix we would have to write:

r"ISBN:\s*\d|ISBN-13:\s*\d"

Clearly having to repeat everything that is in common on either side of the alternation operator is going to make things difficult and this is where grouping comes in. Anything grouped between parentheses is treated as a single unit, a subexpression, and grouping has a higher priority than the alternation operator. So, for example:

r"(ISBN:|ISBN-13:)\s*\d"

matches either form of the ISBN prefix followed by any number of white space characters and a single digit because the parentheses limit the range of the alternation operator to the substrings to the left and right within the bracket.

The greedy/lazy situation also applies to the alternation operator. For example, suppose you try to match the previous ungrouped expression, but without the colon, r"ISBN|ISBN-13". In this case the first pattern, ISBN, will match even if the string is ISBN-13. It doesn’t matter that the second expression is a "better" match. No amount of grouping will help with this problem because the shorter match will be tried and succeed first. The solution is to swap the order of the subexpressions so that the longer comes first or to include something that always marks the end of the target string. In this case for example, if we add the colon then the ISBN: subexpression cannot possibly match the ISBN-13: string.

Groups can also be repeated. For example (ab)* matches any number of repeats of ab.

In chapter but not in this extract

Capture Groups
Backward References
Advanced Capture
String Manipulation
Using Regular Expressions

Summary

Python’s regular expressions are best compiled for efficiency and this returns a regular expression object which has methods that uses the expression.
If you don’t want to compile the expression you can use the alternative regular expression functions, but these aren’t as capable as the equivalent methods.
The methods and functions return a match object, or None if there is no match, which has methods that allow you to find out about the nature of the match.
Regular expressions only become useful when you start to use pattern matching.
You can also use anchors to specify where a match is allowed to happen.
To make patterns easier to write you can use a quantifier symbol to specify the allowable number of repeats.
By default quantifiers are greedy and will always attempt to find the longest match.
You can make a quantifier lazy by adding a ?.
The alternation operator, |, can specify a match to one of two possible patterns.
Grouping can be used to override the precedence of the regular expression operators.
Grouping also leads to the idea of a capture group in which a group matches part of the string. If you don’t want a group to be a capture group you can start it with (?:
You can refer to a capture group by number or you can assign and use a name.
Capture groups are useful for determining which parts of an expression matched and for backward references.
A backward reference lets you match against the results of previous matches.
Assertions are advanced expressions which modify what is captured.
You can also use regular expressions to modify strings and to split strings.

Programmer's Python
Everything is Data

Is now available as a print book: Amazon

pythondata360Contents

Python – A Lightning Tour
The Basic Data Type – Numbers
Extract: Bignum
Truthy & Falsey
Dates & Times
Extract Naive Dates
Sequences, Lists & Tuples
Extract Sequences
Strings
Extract Unicode Strings
Regular Expressions
Extract Simple Regular Expressions
The Dictionary
Extract The Dictionary
Iterables, Sets & Generators
Extract Iterables
Comprehensions
Extract Comprehensions
Data Structures & Collections
Extract Stacks, Queues and Deques
Extract Named Tuples and Counters
Bits & Bit Manipulation
Extract Bits and BigNum
Extract Bit Masks ***NEW!!!
Bytes
Extract Bytes And Strings
Extract Byte Manipulation
Binary Files
Extract Files and Paths
Text Files
Extract Text Files & CSV
Creating Custom Data Classes
Extract A Custom Data Class
Python and Native Code
Extract Native Code
Appendix I Python in Visual Studio Code
Appendix II C Programming Using Visual Studio Code

<ASIN:1871962765>

<ASIN:1871962749>

<ASIN:1871962595>

<ASIN:B0CK71TQ17>

<ASIN:187196265X>

Creating The Python UI With Tkinter

Creating The Python UI With Tkinter - The Canvas Widget

The Python Dictionary

Arrays in Python

Advanced Python Arrays - Introducing NumPy

pico book

Comments

or email your comment to: comments@i-programmer.info

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner

<< Prev - Next

Last Updated ( Monday, 09 December 2024 )