Elegant way to replace substring in a regex with optional groups in Python?

Question 1

Given a string taken from the following set:

strings = [
 "The sky is blue and I like it",
 "The tree is green and I love it",
 "A lemon is yellow"
]

I would like to constuct a function which replaces subject, color and optional verb from this string with others values.

All strings match a certain regex pattern as follow:

regex = r"(?:The|A) (?P<subject>\w+) is (?P<color>\w+)(?: and I (?P<verb>\w+) it)?"

The expected output of such function would look like this:

repl("The sea is blue", "moon", "white", "hate")
# => "The moon is white"

Here is the solution I come with (I can't use .replace() because there is edge cases if the string contains the subject twice for example):

def repl(sentence, subject, color, verb):
 m = re.match(regex, sentence)
 s = sentence
 new_string = s[:m.start("subject")] + subject + s[m.end("subject"):m.start("color")] + color
 if m.group("verb") is None:
 new_string += s[m.end("color"):]
 else:
 new_string += s[m.end("color"):m.start("verb")] + verb + s[m.end("verb"):]
 return new_string

Do you think there is a more straightforward way to implement this?

Question 2

Do you have to use a regex? If not, split(" ") the string into words, replace words 1, 3, and possibly 6, then " ".join(...) it back into a sentence.

Question 3

What do you mean by 'string contains subject twice'? That doesn't seem like it would match your regex.

Question 4

@AJNeufeld This is not possible, actually the sentences are even more dynamic than the examples here and may contain an indefinite number of spaces.

Question 5

@Reinderien For example, repl("The meloon is orange", "orange", "great", "like") or simply repl("A letter is A", "letter", "B", "fail")

Question 6

import re
regex = re.compile(
 r'(The|A) '
 r'\w+'
 r'( is )'
 r'\w+'
 r'(?:'
 r'( and I )'
 r'\w+'
 r'( it)'
 r')?'
)
def repl(sentence, subject, colour, verb=None):
 m = regex.match(sentence)
 new = m.expand(rf'1円 {subject}2円{colour}')
 if m[3]:
 new += m.expand(rf'3円{verb}4円')
 return new
def test():
 assert repl('The sky is blue and I like it', 'bathroom', 'smelly', 'distrust') == \
 'The bathroom is smelly and I distrust it'
 assert repl('The tree is green and I love it', 'pinata', 'angry', 'fear') == \
 'The pinata is angry and I fear it'
 assert repl('A lemon is yellow', 'population', 'dumbfounded') == \
 'A population is dumbfounded'

Essentially, invert the sections of the regex around which you put groups; they're the things you want to save.

Question 7

I did not know expand(), this seems very useful. Thanks!

Question 8

You might want to experiment with NLTK, a leading platform for building Python programs to work with human language data:

You could import it, tags the words (NOUN, ADJ, ...) and replace words in the original sentence according to their tags:

import nltk
from collections import defaultdict
from nltk.tag import pos_tag, map_tag
def simple_tags(words):
 #see https://stackoverflow.com/a/5793083/6419007
 return [(word, map_tag('en-ptb', 'universal', tag)) for (word, tag) in nltk.pos_tag(words)]
def repl(sentence, *new_words):
 new_words_by_tag = defaultdict(list)
 for new_word, tag in simple_tags(new_words):
 new_words_by_tag[tag].append(new_word)
 new_sentence = []
 for word, tag in simple_tags(nltk.word_tokenize(sentence)):
 possible_replacements = new_words_by_tag.get(tag)
 if possible_replacements:
 new_sentence.append(possible_replacements.pop(0))
 else:
 new_sentence.append(word)
 return ' '.join(new_sentence)
repl("The sea is blue", "moon", "white", "hate")
# 'The moon is white'
repl("The sea is blue", "yellow", "elephant")
# 'The elephant is yellow'

This version is brittle though, because some verbs appear to be nouns or vice-versa.

I guess someone with more NLTK experience could find a more robust way to replace the words.

Question 9

Here is a solution using the original format string, instead of the inverted format string suggested by Reindeerien.

Your difficulty come in manually building up the original string parts from the spans of the original string. If you maintained a list of the starting points (which is the start of the string and the end of every group), and a list of the ending points (which is the start of every group, and the end of the string), you could use these to retrieve the parts of the original string you want to keep:

start = [0] + [m.end(i+1) for i in range(m.lastindex)]
end = [m.start(i+1) for i in range(m.lastindex)] + [None]

We can glue these parts together with a placeholder which we will substitute the desired value in:

fmt = "{}".join(sentence[s:e] for s, e in zip(start, end))

Using "{}" as the joiner will create a string like The {} is {} and I {} it, which makes a perfect .format() string to substitute in the desired replacements:

def repl(sentence, subject, color, verb=None):
 m = re.match(regex, sentence)
 start = [0] + [m.end(i+1) for i in range(m.lastindex)]
 end = [m.start(i+1) for i in range(m.lastindex)] + [None]
 fmt = "{}".join(sentence[s:e] for s, e in zip(start, end))
 return fmt.format(subject, color, verb)

If you dont mind being a little cryptic, we can even make this into a shorter 3-line function:

def repl(sentence, subject, color, verb=None):
 m = re.match(regex, sentence)
 idx = [0] + [pos for i in range(m.lastindex) for pos in m.span(i+1)] + [None]
 return "{}".join(sentence[s:e] for s, e in zip(*[iter(idx)]*2)).format(subject, color, verb)

score 12 · Accepted Answer · 2019-03-29 14:34:32Z

import re
regex = re.compile(
 r'(The|A) '
 r'\w+'
 r'( is )'
 r'\w+'
 r'(?:'
 r'( and I )'
 r'\w+'
 r'( it)'
 r')?'
)
def repl(sentence, subject, colour, verb=None):
 m = regex.match(sentence)
 new = m.expand(rf'1円 {subject}2円{colour}')
 if m[3]:
 new += m.expand(rf'3円{verb}4円')
 return new
def test():
 assert repl('The sky is blue and I like it', 'bathroom', 'smelly', 'distrust') == \
 'The bathroom is smelly and I distrust it'
 assert repl('The tree is green and I love it', 'pinata', 'angry', 'fear') == \
 'The pinata is angry and I fear it'
 assert repl('A lemon is yellow', 'population', 'dumbfounded') == \
 'A population is dumbfounded'

Essentially, invert the sections of the regex around which you put groups; they're the things you want to save.

2

\$\begingroup\$ I did not know expand(), this seems very useful. Thanks! \$\endgroup\$

Delgan
– Delgan

2019年03月29日 15:12:08 +00:00
Commented Mar 29, 2019 at 15:12

Stack Exchange Network

Elegant way to replace substring in a regex with optional groups in Python?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Elegant way to replace substring in a regex with optional groups in Python?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions