My implemented regex pattern contains two repeating symbols: \d{2}\.
and <p>(.*)</p>
. I want to get rid of this repetition and asked myself if there is a way to loop in Python's regular expression implementation.
Note: I do not ask for help to parse a XML file. There are many great tutorials, howtos and libraries. I am looking for means to implement repetition in regex patterns.
My code:
import re
pattern = '''
<menu>
<day>\w{2} (\d{2}\.\d{2})\.</day>
<description>
<p>(.*)</p>
<p>(.*)</p>
<p>(.*)</p>
</description>
'''
my_example_string = '''
<menu>
<day>Mi 03.04.</day>
<description>
<p>Knoblauchcremesuppe</p>
<p>Rindsbraten "Esterhazy" (Gemüserahmsauce)</p>
<p>mit Hörnchen und Salat</p>
</description>
</menu>
'''
re.findall(pattern, my_example_string, re.MULTILINE)
-
1\$\begingroup\$ Parsing XML with regex is usually wrong, what are you really trying to accomplish? \$\endgroup\$konijn– konijn2013年04月01日 14:30:35 +00:00Commented Apr 1, 2013 at 14:30
-
\$\begingroup\$ The XML is malformed what prevents a usage of LXML and Xpath. I easily can retrieve the deserved data, but I want to find a way to avoid these repetitions in any regex patterns. \$\endgroup\$Philipp– Philipp2013年04月01日 14:37:01 +00:00Commented Apr 1, 2013 at 14:37
1 Answer 1
Firstly, just for anyone who might read this: DO NOT take this as an excuse to parse your XML with regular expressions. It generally a really really bad idea! In this case the XML is malformed, so its the best we can do.
The regular expressions looping constructs are *
and {4}
which you already using. But this is python, so you can construct your regular expression using python:
expression = """
<menu>
<day>\w{2} (\d{2}\.\d{2})\.</day>
<description>
"""
for x in xrange(3):
expression += "<p>(.*)</p>"
expression += """
</description>
</menu>
"""
-
\$\begingroup\$ What about
expression += "<p>(.*)</p>\n" * 3
? \$\endgroup\$Gareth Rees– Gareth Rees2013年04月01日 18:40:05 +00:00Commented Apr 1, 2013 at 18:40