Programmer's Python Data - Simple Regular Expressions
Written by Mike James
Monday, 09 December 2024
Article Index
Programmer's Python Data - Simple Regular Expressions
Pattern Matching
Grouping and Alternation
Page 3 of 3

If you don’t want greedy quantifiers the solution is to use "lazy" quantifiers which are formed by following any of the standard quantifiers by a question mark, ?. To see this in action, change the previous regular expression to read:

ex2 = re.compile(r"<div>.*?</div>")

With this change in place, the result of matching to:

r"<div>hello</div>world</div>"

is just the first pair of <div> brackets, that is <div>hello</div>.

All of the quantifiers, including ?, have a lazy version and you can write ?? to mean a lazy "zero or one" occurrence.

The distinction between greedy and lazy quantifiers is perhaps the biggest reason for a reasonably well-tested regular expression to go wrong when used against a wider range of example strings. Always remember that a standard greedy quantifier will match as many times as possible while still allowing the regular expression to match, and its lazy version will match as few as possible times to make the regular expression match.

Grouping and Alternation

Regular strings often have alternative forms. For example, the ISBN prefix could be simply ISBN: or it could be ISBN-13: or any of many other reasonable variations. You can specify an either/or situation using the vertical bar |, the alternation operator, as in x|y which will match an x or a y.

For example, r"ISBN:|ISBN-13:" matches either ISBN: or ISBN-13:.

This is easy enough but what about:

r"ISBN:|ISBN-13:\s*\d"

At first glance this seems to match either ISBN: or ISBN-13: followed by any number of white space characters and a single digit, – but it doesn’t. The | operator has the lowest priority and the alternative matches are everything the left and everything to the right, i.e. either ISBN: or ISBN-13:\s*\d. To match the white space and digit in both forms of the ISBN prefix we would have to write:

r"ISBN:\s*\d|ISBN-13:\s*\d"

Clearly having to repeat everything that is in common on either side of the alternation operator is going to make things difficult and this is where grouping comes in. Anything grouped between parentheses is treated as a single unit, a subexpression, and grouping has a higher priority than the alternation operator. So, for example:

r"(ISBN:|ISBN-13:)\s*\d"

matches either form of the ISBN prefix followed by any number of white space characters and a single digit because the parentheses limit the range of the alternation operator to the substrings to the left and right within the bracket.

The greedy/lazy situation also applies to the alternation operator. For example, suppose you try to match the previous ungrouped expression, but without the colon, r"ISBN|ISBN-13". In this case the first pattern, ISBN, will match even if the string is ISBN-13. It doesn’t matter that the second expression is a "better" match. No amount of grouping will help with this problem because the shorter match will be tried and succeed first. The solution is to swap the order of the subexpressions so that the longer comes first or to include something that always marks the end of the target string. In this case for example, if we add the colon then the ISBN: subexpression cannot possibly match the ISBN-13: string.

Groups can also be repeated. For example (ab)* matches any number of repeats of ab.

In chapter but not in this extract

  • Capture Groups
  • Backward References
  • Advanced Capture
  • String Manipulation
  • Using Regular Expressions

Summary

  • Python’s regular expressions are best compiled for efficiency and this returns a regular expression object which has methods that uses the expression.

  • If you don’t want to compile the expression you can use the alternative regular expression functions, but these aren’t as capable as the equivalent methods.

  • The methods and functions return a match object, or None if there is no match, which has methods that allow you to find out about the nature of the match.

  • Regular expressions only become useful when you start to use pattern matching.

  • You can also use anchors to specify where a match is allowed to happen.

  • To make patterns easier to write you can use a quantifier symbol to specify the allowable number of repeats.

  • By default quantifiers are greedy and will always attempt to find the longest match.

  • You can make a quantifier lazy by adding a ?.

  • The alternation operator, |, can specify a match to one of two possible patterns.

  • Grouping can be used to override the precedence of the regular expression operators.

  • Grouping also leads to the idea of a capture group in which a group matches part of the string. If you don’t want a group to be a capture group you can start it with (?:

  • You can refer to a capture group by number or you can assign and use a name.

  • Capture groups are useful for determining which parts of an expression matched and for backward references.

  • A backward reference lets you match against the results of previous matches.

  • Assertions are advanced expressions which modify what is captured.

  • You can also use regular expressions to modify strings and to split strings.

Programmer's Python
Everything is Data

Is now available as a print book: Amazon

pythondata360Contents

  1. Python – A Lightning Tour
  2. The Basic Data Type – Numbers
    Extract: Bignum
  3. Truthy & Falsey
  4. Dates & Times
    Extract Naive Dates
  5. Sequences, Lists & Tuples
    Extract Sequences
  6. Strings
    Extract Unicode Strings
  7. Regular Expressions
    Extract Simple Regular Expressions
  8. The Dictionary
    Extract The Dictionary
  9. Iterables, Sets & Generators
    Extract Iterables
  10. Comprehensions
    Extract Comprehensions
  11. Data Structures & Collections
    Extract Stacks, Queues and Deques
    Extract Named Tuples and Counters
  12. Bits & Bit Manipulation
    Extract Bits and BigNum
    Extract Bit Masks ***NEW!!!
  13. Bytes
    Extract Bytes And Strings
    Extract Byte Manipulation
  14. Binary Files
    Extract Files and Paths
  15. Text Files
    Extract Text Files & CSV
  16. Creating Custom Data Classes
    Extract A Custom Data Class
  17. Python and Native Code
    Extract Native Code
    Appendix I Python in Visual Studio Code
    Appendix II C Programming Using Visual Studio Code

<ASIN:1871962765>

<ASIN:1871962749>

<ASIN:1871962595>

<ASIN:B0CK71TQ17>

<ASIN:187196265X>

Related Articles

Creating The Python UI With Tkinter

Creating The Python UI With Tkinter - The Canvas Widget

The Python Dictionary

Arrays in Python

Advanced Python Arrays - Introducing NumPy

pico book

Comments




or email your comment to: comments@i-programmer.info

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.


<< Prev - Next

Last Updated ( Monday, 09 December 2024 )