I have a small module that gets the lemma of a word and its plural form. It then searches through sentences looking for a sentence that contains both words (singular or plural) in either order. I have it working but I was wondering if there is a more elegant way to build this expression.
Note: Python2
words = ((cell,), (wolf,wolves))
string1 = "(?:"+"|".join(words[0])+")"
string2 = "(?:"+"|".join(words[1])+")"
pat = ".+".join((string1, string2)) +"|"+ ".+".join((string2, string1))
# Pat output: "(?:cell).+(?:wolf|wolves)|(?:wolf|wolves).+(?:cell)"
Then the search:
pat = re.compile(pat)
for sentence in sentences:
if len(pat.findall(sentence)) != 0:
print sentence+'\n'
Alternatively, would this be a good solution?
words = ((cell,), (wolf,wolves))
for sentence in sentences:
sentence = sentence.lower()
if any(word in sentence for word in words[0]) and any(word in sentence for word in words[1]):
print sentence
2 Answers 2
You could use findall
with a pattern like (cell)|(wolf|wolves)
and check if every group was matched:
words = (("cell",), ("wolf","wolves"))
pat = "|".join(("({0})".format("|".join(forms)) for forms in words))
regex = re.compile(pat)
for sentence in sentences:
matches = regex.findall(sentence)
if all(any(groupmatches) for groupmatches in zip(*matches)):
print sentence
-
\$\begingroup\$ A step further than me. Seems good to me. \$\endgroup\$eyquem– eyquem2013年12月08日 20:17:25 +00:00Commented Dec 8, 2013 at 20:17
Maybe, you will find this way of writing more easy to read:
words = (('cell',), ('wolf','wolves'))
string1 = "|".join(words[0]).join(('(?:',')'))
print string1
string2 = "|".join(words[1]).join(('(?:',')'))
print string2
pat = "|".join((
".+".join((string1, string2)) ,
".+".join((string2, string1))
))
print pat
My advice is also to use '.+?'
instead of just '.+'
. It will spare time to the regex motor when it will run through the analysed string: it will stop as soon as it will encouters the following unary pattern.
Another adavantage is that it can be easily extended when there are several couples noun/plural.
-
\$\begingroup\$ Silly question but isn't ".+?" the same thing as ".*" ? \$\endgroup\$SylvainD– SylvainD2013年12月08日 19:50:18 +00:00Commented Dec 8, 2013 at 19:50
-
1\$\begingroup\$ @Josay No. See in this link : (docs.python.org/2/library/re.html#regular-expression-syntax)
.+
is greedy,.+?
is ungreedy. It means that in case of'..cell....wolf.......'
analysed, the regex motor of pattern(?:cell).+(?:wolf|wolves)
will match cell and then.+
will match all the subsequent characters, dots and wolf comprised, until the end of the string; there it will realize that it can't match(?:wolf|wolves)
with anything else. So it will move backward and to search again in order to find such a pattern. \$\endgroup\$eyquem– eyquem2013年12月08日 20:06:14 +00:00Commented Dec 8, 2013 at 20:06 -
1\$\begingroup\$ Then pattern
(?:cell).+(wolf\d|wolves)
will match'wolf2'
in',,cell,,wolf1,,,wolf2,,,'
while(?:cell).+?(wolf\d|wolves)
will match'wolf1'
\$\endgroup\$eyquem– eyquem2013年12月08日 20:09:43 +00:00Commented Dec 8, 2013 at 20:09