Either or case in Python and Regex

Question 1

I have a small module that gets the lemma of a word and its plural form. It then searches through sentences looking for a sentence that contains both words (singular or plural) in either order. I have it working but I was wondering if there is a more elegant way to build this expression.

Note: Python2

words = ((cell,), (wolf,wolves))
string1 = "(?:"+"|".join(words[0])+")"
string2 = "(?:"+"|".join(words[1])+")"
pat = ".+".join((string1, string2)) +"|"+ ".+".join((string2, string1))
# Pat output: "(?:cell).+(?:wolf|wolves)|(?:wolf|wolves).+(?:cell)"

Then the search:

pat = re.compile(pat)
for sentence in sentences:
 if len(pat.findall(sentence)) != 0:
 print sentence+'\n'

Alternatively, would this be a good solution?

words = ((cell,), (wolf,wolves))
for sentence in sentences:
 sentence = sentence.lower()
 if any(word in sentence for word in words[0]) and any(word in sentence for word in words[1]):
 print sentence

Question 2

You could use findall with a pattern like (cell)|(wolf|wolves) and check if every group was matched:

words = (("cell",), ("wolf","wolves"))
pat = "|".join(("({0})".format("|".join(forms)) for forms in words))
regex = re.compile(pat)
for sentence in sentences:
 matches = regex.findall(sentence)
 if all(any(groupmatches) for groupmatches in zip(*matches)):
 print sentence

Question 3

A step further than me. Seems good to me.

Question 4

Maybe, you will find this way of writing more easy to read:

words = (('cell',), ('wolf','wolves'))
string1 = "|".join(words[0]).join(('(?:',')'))
print string1
string2 = "|".join(words[1]).join(('(?:',')'))
print string2
pat = "|".join((
 ".+".join((string1, string2)) ,
 ".+".join((string2, string1))
 ))
print pat

My advice is also to use '.+?' instead of just '.+'. It will spare time to the regex motor when it will run through the analysed string: it will stop as soon as it will encouters the following unary pattern.

Another adavantage is that it can be easily extended when there are several couples noun/plural.

Question 5

Silly question but isn't ".+?" the same thing as ".*" ?

Question 6

@Josay No. See in this link : (docs.python.org/2/library/re.html#regular-expression-syntax) .+ is greedy, .+? is ungreedy. It means that in case of '..cell....wolf.......' analysed, the regex motor of pattern (?:cell).+(?:wolf|wolves) will match cell and then .+ will match all the subsequent characters, dots and wolf comprised, until the end of the string; there it will realize that it can't match (?:wolf|wolves) with anything else. So it will move backward and to search again in order to find such a pattern.

Question 7

Then pattern (?:cell).+(wolf\d|wolves) will match 'wolf2' in ',,cell,,wolf1,,,wolf2,,,' while (?:cell).+?(wolf\d|wolves) will match 'wolf1'

Janne Karila Janne Karila 10.6k21 silver badges34 bronze badges · Answer 1 · 2013-12-08 20:02:14Z

You could use findall with a pattern like (cell)|(wolf|wolves) and check if every group was matched:

words = (("cell",), ("wolf","wolves"))
pat = "|".join(("({0})".format("|".join(forms)) for forms in words))
regex = re.compile(pat)
for sentence in sentences:
 matches = regex.findall(sentence)
 if all(any(groupmatches) for groupmatches in zip(*matches)):
 print sentence

\$\begingroup\$ A step further than me. Seems good to me. \$\endgroup\$

eyquem
– eyquem

2013年12月08日 20:17:25 +00:00
Commented Dec 8, 2013 at 20:17

eyquem eyquem 1333 bronze badges · Answer 2 · 2013-12-08 19:48:52Z

1

\$\begingroup\$

Maybe, you will find this way of writing more easy to read:

words = (('cell',), ('wolf','wolves'))
string1 = "|".join(words[0]).join(('(?:',')'))
print string1
string2 = "|".join(words[1]).join(('(?:',')'))
print string2
pat = "|".join((
 ".+".join((string1, string2)) ,
 ".+".join((string2, string1))
 ))
print pat

My advice is also to use '.+?' instead of just '.+'. It will spare time to the regex motor when it will run through the analysed string: it will stop as soon as it will encouters the following unary pattern.

Another adavantage is that it can be easily extended when there are several couples noun/plural.

Share

answered Dec 8, 2013 at 19:48

eyquem's user avatar

eyquem eyquem

1333 bronze badges

\$\endgroup\$

3

\$\begingroup\$ Silly question but isn't ".+?" the same thing as ".*" ? \$\endgroup\$

SylvainD
– SylvainD

2013年12月08日 19:50:18 +00:00
Commented Dec 8, 2013 at 19:50
1

\$\begingroup\$ @Josay No. See in this link : (docs.python.org/2/library/re.html#regular-expression-syntax) .+ is greedy, .+? is ungreedy. It means that in case of '..cell....wolf.......' analysed, the regex motor of pattern (?:cell).+(?:wolf|wolves) will match cell and then .+ will match all the subsequent characters, dots and wolf comprised, until the end of the string; there it will realize that it can't match (?:wolf|wolves) with anything else. So it will move backward and to search again in order to find such a pattern. \$\endgroup\$

eyquem
– eyquem

2013年12月08日 20:06:14 +00:00
Commented Dec 8, 2013 at 20:06
1

\$\begingroup\$ Then pattern (?:cell).+(wolf\d|wolves) will match 'wolf2' in ',,cell,,wolf1,,,wolf2,,,' while (?:cell).+?(wolf\d|wolves) will match 'wolf1' \$\endgroup\$

eyquem
– eyquem

2013年12月08日 20:09:43 +00:00
Commented Dec 8, 2013 at 20:09

Add a comment |

Stack Exchange Network

Either or case in Python and Regex

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Either or case in Python and Regex

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions