3
\$\begingroup\$

I have a small module that gets the lemma of a word and its plural form. It then searches through sentences looking for a sentence that contains both words (singular or plural) in either order. I have it working but I was wondering if there is a more elegant way to build this expression.

Note: Python2

words = ((cell,), (wolf,wolves))
string1 = "(?:"+"|".join(words[0])+")"
string2 = "(?:"+"|".join(words[1])+")"
pat = ".+".join((string1, string2)) +"|"+ ".+".join((string2, string1))
# Pat output: "(?:cell).+(?:wolf|wolves)|(?:wolf|wolves).+(?:cell)"

Then the search:

pat = re.compile(pat)
for sentence in sentences:
 if len(pat.findall(sentence)) != 0:
 print sentence+'\n'

Alternatively, would this be a good solution?

words = ((cell,), (wolf,wolves))
for sentence in sentences:
 sentence = sentence.lower()
 if any(word in sentence for word in words[0]) and any(word in sentence for word in words[1]):
 print sentence
asked Dec 8, 2013 at 19:08
\$\endgroup\$

2 Answers 2

3
\$\begingroup\$

You could use findall with a pattern like (cell)|(wolf|wolves) and check if every group was matched:

words = (("cell",), ("wolf","wolves"))
pat = "|".join(("({0})".format("|".join(forms)) for forms in words))
regex = re.compile(pat)
for sentence in sentences:
 matches = regex.findall(sentence)
 if all(any(groupmatches) for groupmatches in zip(*matches)):
 print sentence
answered Dec 8, 2013 at 20:02
\$\endgroup\$
1
  • \$\begingroup\$ A step further than me. Seems good to me. \$\endgroup\$ Commented Dec 8, 2013 at 20:17
1
\$\begingroup\$

Maybe, you will find this way of writing more easy to read:

words = (('cell',), ('wolf','wolves'))
string1 = "|".join(words[0]).join(('(?:',')'))
print string1
string2 = "|".join(words[1]).join(('(?:',')'))
print string2
pat = "|".join((
 ".+".join((string1, string2)) ,
 ".+".join((string2, string1))
 ))
print pat

My advice is also to use '.+?' instead of just '.+'. It will spare time to the regex motor when it will run through the analysed string: it will stop as soon as it will encouters the following unary pattern.

Another adavantage is that it can be easily extended when there are several couples noun/plural.

answered Dec 8, 2013 at 19:48
\$\endgroup\$
3
  • \$\begingroup\$ Silly question but isn't ".+?" the same thing as ".*" ? \$\endgroup\$ Commented Dec 8, 2013 at 19:50
  • 1
    \$\begingroup\$ @Josay No. See in this link : (docs.python.org/2/library/re.html#regular-expression-syntax) .+ is greedy, .+? is ungreedy. It means that in case of '..cell....wolf.......' analysed, the regex motor of pattern (?:cell).+(?:wolf|wolves) will match cell and then .+ will match all the subsequent characters, dots and wolf comprised, until the end of the string; there it will realize that it can't match (?:wolf|wolves) with anything else. So it will move backward and to search again in order to find such a pattern. \$\endgroup\$ Commented Dec 8, 2013 at 20:06
  • 1
    \$\begingroup\$ Then pattern (?:cell).+(wolf\d|wolves) will match 'wolf2' in ',,cell,,wolf1,,,wolf2,,,' while (?:cell).+?(wolf\d|wolves) will match 'wolf1' \$\endgroup\$ Commented Dec 8, 2013 at 20:09

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.