Given a string and a list of substring that should be replaces as placeholders, e.g.
import re
from copy import copy
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
The first goal is to first replace the substrings from phrases in the original_text with indexed placeholders, e.g.
text = copy(original_text)
backplacement = {}
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
[out]:
Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen
Then there'll be some functions to manipulate the text with the placeholders, e.g.
cleaned_text = func('Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen')
print(cleaned_text)
that outputs:
MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2
the last step is to do the replacement we did in a backwards manner and put back the original phrases, i.e.
' '.join([backplacement[tok] if tok in backplacement else tok for tok in clean_text.split()])
[out]:
"'s_morgen ik 's-Hertogenbosch depository_financial_institution"
The questions are:
- If the list of substrngs in
phrasesis huge, the time to do the 1st replacement and the last backplacement would take very long.
Is there a way to do the replacement/backplacement with a regex?
- using the
re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)regex substitution isn't very helpful esp. if there are substrings in the phrases that matches not the full word,
E.g.
phrases = ["org", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
we get an awkward output:
Something, 's mMWEPHRASE0en, ik MWEPHRASE1 im das MWEPHRASE2 gehen
I've tried using '\b{}\b'.format(phrase) but that'll didn't work for the phrases with punctuations, i.e.
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"\b{}\b".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
[out]:
Something, 's morgen, ik 's-Hertogenbosch im das MWEPHRASE2 gehen
Is there some where to denote the word boundary for the phrases in the re.sub regex pattern?
4 Answers 4
Instead of using re.sub you can split it!
def do_something_with_str(string):
# do something with string here.
# for example let's wrap the string with "@" symbol if it's not empty
return f"@{string}" if string else string
def get_replaced_list(string, words):
result = [(string, True), ]
# we take each word we want to replace
for w in words:
new_result = []
# Getting each word in old result
for r in result:
# Now we split every string in results using our word.
split_list = list((x, True) for x in r[0].split(w)) if r[1] else list([r, ])
# If we replace successfully - add all the strings
if len(split_list) > 1:
# This one would be for [text, replaced, text, replaced...]
sub_result = []
ws = [(w, False), ] * (len(split_list) - 1)
for x, replaced in zip(split_list, ws):
sub_result.append(x)
sub_result.append(replaced)
sub_result.append(split_list[-1])
# Add to new result
new_result.extend(sub_result)
# If not - just add it to results
else:
new_result.extend(split_list)
result = new_result
return result
if __name__ == '__main__':
initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = get_replaced_list(initial_string, words_to_replace)
modified_list = [(do_something_with_str(x[0]), True) if x[1] else x for x in replaced_list]
final_string = ''.join([x[0] for x in modified_list])
Here's variables values of the example above:
initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = [('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True)]
modified_list = [('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True)]
final_string = 'ac@bbc@bbcac@bbc@bbcac@bbc@bbca'
As you can see the lists contain tuples. They contain two values - some string and boolean, representing whether it's a text or replaced value (True when text).
After you get replaced list, you can modify it as in the example, checking if it's text value (if x[1] == True).
Hope that helps!
P.S. String formatting like f"some string here {some_variable_here}" requires Python 3.6
Comments
I think there are two keys to using regular expressions for this task:
Use custom boundaries, capture them, and substitute them back, along with the phrase.
Use a function to process the substitution matches, in both directions.
Below is an implementation that uses this approach. I tweaked your text slightly to repeat one of the phrases.
import re
from copy import copy
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen 's morgen"
text = copy(original_text)
#
# The phrases of interest
#
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
#
# Create the mapping dictionaries
#
phrase_to_mwe = {}
mwe_to_phrase = {}
#
# Build the mappings
#
for i, phrase in enumerate(phrases):
mwephrase = "MWEPHRASE{}".format(i)
mwe_to_phrase[mwephrase] = phrase.replace(' ', '_')
phrase_to_mwe[phrase] = mwephrase
#
# Regex match handlers
#
def handle_forward(match):
b1 = match.group(1)
phrase = match.group(2)
b2 = match.group(3)
return b1 + phrase_to_mwe[phrase] + b2
def handle_backward(match):
return mwe_to_phrase[match.group(1)]
#
# The forward regex will look like:
#
# (^|[ ])('s morgen|'s-Hertogenbosch|depository financial institution)([, ]|$)
#
# which captures three components:
#
# (1) Front boundary
# (2) Phrase
# (3) Back boundary
#
# Anchors allow matching at the beginning and end of the text. Addtional boundary characters can be
# added as necessary, e.g. to allow semicolons after a phrase, we could update the back boundary to:
#
# ([,; ]|$)
#
regex_forward = re.compile(r'(^|[ ])(' + '|'.join(phrases) + r')([, ]|$)')
regex_backward = re.compile(r'(MWEPHRASE\d+)')
#
# Pretend we cleaned the text in the middle
#
cleaned = 'MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2 MWEPHRASE0'
#
# Do the translations
#
text1 = regex_forward .sub(handle_forward, text)
text2 = regex_backward.sub(handle_backward, cleaned)
print('original: {}'.format(original_text))
print('text1 : {}'.format(text1))
print('text2 : {}'.format(text2))
Running this generates:
original: Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen 's morgen
text1 : Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen MWEPHRASE0
text2 : 's_morgen ik 's-Hertogenbosch depository_financial_institution 's_morgen
Comments
Here's a strategy you could use:
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
# need this module for the reduce function
import functools as fn
#convert phrases into a dictionary of numbered placeholders (tokens)
tokens = { kw:"MWEPHRASE%s"%i for i,kw in enumerate(phrases) }
#replace embedded phrases with their respective token
tokenized = fn.reduce(lambda s,kw: tokens[kw].join(s.split(kw)), phrases, original_text)
#Apply text cleaning logic on the tokenized text
#This assumes the placeholders are left untouched,
#although it's ok to move them around)
cleaned_text = cleanUpfunction(tokenized)
#reverse the token dictionary (to map original phrases to numbered placeholders)
unTokens = {v:k for k,v in tokens.items() }
#rebuild phrases with original text associated to each token (placeholder)
final_text = fn.reduce(lambda s,kw: unTokens[kw].join(s.split(kw)), phrases, cleaned_text)
Comments
What you are looking for is called "multi-string search" or "multi-pattern search". The more common solutions are Aho-Corasick and Rabin-Karp algorithms. If you want to impement it yourself, go with Rabin-Karp, since its easier to grasp. Otherwise, you'll find some libraries. Here's a solution with the library https://pypi.python.org/pypi/py_aho_corasick.
Let
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
And, for testing purposes:
def clean(text):
"""A simple stub"""
assert text == 'Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen'
return "MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2"
Now, you have to define two automatons, one for the outward journey, and the other for the return. An automaton is defined by a list of (key,value):
fore_automaton = py_aho_corasick.Automaton([(phrase,"MWEPHRASE{}".format(i)) for i, phrase in enumerate(phrases)])
back_automaton = py_aho_corasick.Automaton([("MWEPHRASE{}".format(i), phrase.replace(' ','_')) for i, phrase in enumerate(phrases)])
The automaton will scan the text and return a list of matches. A match is a triplet (position, key, value). With a little work on matches, you'll be able to replace keys by values:
def process(automaton, text):
"""Returns a new text, with keys of the automaton replaced by values"""
matches = automaton.get_keywords_found(text.lower()) # text.lower() because auomaton of py_aho_corasick uses lowercase for keys
bk_value_eks = [(i,v,i+len(k)) for i,k,v in matches] # (begin of key, value, end of key)
chunks = [bk_value_ek1[1]+text[bk_value_ek1[2]:bk_value_ek2[0]] for bk_value_ek1,bk_value_ek2 in zip([(-1,"",0)]+bk_value_eks, bk_value_eks+[(len(text),"",-1)] if bk_value_ek1[2] <= bk_value_ek2[0]] # see below
return "".join(chunks)
A brief explanation on chunks = [bk_value_ek1[1]+text[bk_value_ek1[2]:bk_value_ek2[0]] for bk_value_ek1,bk_value_ek2 in zip([(-1,"",0)]+bk_value_eks, bk_value_eks+[(len(text),"",-1)] if bk_value_ek1[2] <= bk_value_ek2[0]].
I zip matches with itself almost as usual : zip(arr, arr[1:]) will output (arr[0], arr[1)), (arr[1], arr[2]), ... to consider every match with its sucessor. Here I placed two sentinels
to handle the begin and the end of matches.
- For the normal case, I just output the value (=
bk_value_ek1[1]) and the text between the end of the key and the begin of the next key (text[bk_value_ek1[2]:bk_value_ek2[0]). - The begin sentinel has an empty value, and its key ends at position 0, thus the first chunk will be "" + text[0:begin of key1], that is the text before the first key.
- Similarly, the end sentinel has also an empty value and its key begins at the end of the text, thus the last chunk will be : value of the last match + text[end of the last key:len(text)].
What happens when keys overlap? Take an example: text="abcdef", phrases={"bcd":"1", "cde":"2"}. You have two matches: (1, "bcd", "1") and (2, "cde", "3").
Let's go: bk_value_eks = [(1, "1", 4), (2, "2", 5)]. Thus, without if bk_value_ek1[2] <= bk_value_ek2[0], the text will be replaced by text[:1]+"1"+text[4:2]+"2"+text[5:],
that is "a"+"1"+""+"2"+"f" = "a12f" instead of "a1ef" (ignore the second match)...
Now, take a look a the result:
print(process(back_automaton, clean(process(fore_automaton, original_text))))
# "'s_morgen ik 's-Hertogenbosch depository_financial_institution"
You don't have to define a new process function for the return, just give it the back_automaton and it will do the job.
Comments
Explore related questions
See similar questions with these tags.
phrasesare removed, except forik. Why is that?Then there'll be some functions to manipulate the text with the placeholders. So, you have a function to work on the text after adding the placeholders. And that function must do a split on whitespace or something. So, now you have an array where you manipulate all the elements except the placeholders, then you want to join the array into a string, then substitute the placeholders back using the real words. Is that correct ?((?:(?!phrase1|phrase2|phrase3)[\S\s])+)|(phrase1|phrase2|phrase3). Where, capture group 1 is a non-phrase string part, capture group 2 is a phrase.r"(?<!\w){}(?!\w)".format(phrase). Since some of your keywords start with a non-word chars, you cannot use\b. Could you please provide some more logic that you need to implement? It looks like you might need to pass a callback/lambda as the second argument tore.subto replace each match just once.