Python: replace string, matched from a list

Question 1

Trying to match and mark character based n-grams. The string

txt = "how does this work"

is to be matched with n-grams from the list

ngrams = ["ow ", "his", "s w"]

and marked with <> – however, only if there is no preceding opened quote. The output i am seeking for this string is h<ow >does t<his w>ork (notice the double match in the 2^-nd part, but within just 1 pair of expected quotes).

The for loop i’ve tried for this doesn’t, however, produce the wanted output at all:

switch = False
for i in txt:
 if i in "".join(ngrams) and switch == False:
 txt = txt.replace(i, "<" + i)
 switch = True
 if i not in "".join(ngrams) and switch == True:
 txt = txt.replace(i, ">" + i)
 switch = False
print(txt)

Any help would be greatly appreciated.

Question 2

What does it produce instead ?

Question 3

What should happen if you have ngrams = ['his', 's wo', 'wor']. Do you expect to get how does t<his wor>k?

Question 4

How to handle multiple matches? E.g. ngrams = ['ab'], txt = 'abominable abs'. Do you expect <ab>ominable abs, <ab>ominable <ab>s or abominable <ab>s?

Question 5

How to handle ngram element that is not present in txt?

Question 6

@ Mad Physicist, 1) aye – 2) # II – 3) in no way; the list is actually a filter for the string, changing only string parts present in both

Question 7

This solution uses the str.find method to find all copies of an ngram within the txt string, saving the indices of each copy to the indices set so we can easily handle overlapping matches.

We then copy txt, char by char to the result list, inserting angle brackets where required. This strategy is more efficient than inserting the angle brackets using multiple .replace call because each .replace call needs to rebuild the whole string.

I've extended your data slightly to illustrate that my code handles multiple copies of an ngram.

txt = "how does this work now chisolm"
ngrams = ["ow ", "his", "s w"]
print(txt)
print(ngrams)
# Search for all copies of each ngram in txt
# saving the indices where the ngrams occur
indices = set()
for s in ngrams:
 slen = len(s)
 lo = 0
 while True:
 i = txt.find(s, lo)
 if i == -1:
 break
 lo = i + slen 
 print(s, i)
 indices.update(range(i, lo-1))
print(indices)
# Copy the txt to result, inserting angle brackets
# to show matches
switch = True
result = []
for i, u in enumerate(txt):
 if switch:
 if i in indices:
 result.append('<')
 switch = False
 result.append(u)
 else:
 result.append(u)
 if i not in indices:
 result.append('>')
 switch = True
print(''.join(result))

output

how does this work now chisolm
['ow ', 'his', 's w']
ow 1
ow 20
his 10
his 24
s w 12
{1, 2, 10, 11, 12, 13, 20, 21, 24, 25}
h<ow >does t<his w>ork n<ow >c<his>olm

If you want adjacent groups to be merged, we can easily do that using the str.replace method. But to make that work properly we need to pre-process the original data, converting all runs of whitespace to single spaces. A simple way to do that is to split the data and re-join it.

txt = "how does this\nwork now chisolm hisow"
ngrams = ["ow", "his", "work"]
#Convert all whitespace to single spaces
txt = ' '.join(txt.split())
print(txt)
print(ngrams)
# Search for all copies of each ngram in txt
# saving the indices where the ngrams occur
indices = set()
for s in ngrams:
 slen = len(s)
 lo = 0
 while True:
 i = txt.find(s, lo)
 if i == -1:
 break
 lo = i + slen 
 print(s, i)
 indices.update(range(i, lo-1))
print(indices)
# Copy the txt to result, inserting angle brackets
# to show matches
switch = True
result = []
for i, u in enumerate(txt):
 if switch:
 if i in indices:
 result.append('<')
 switch = False
 result.append(u)
 else:
 result.append(u)
 if i not in indices:
 result.append('>')
 switch = True
# Convert the list to a single string
output = ''.join(result)
# Merge adjacent groups
output = output.replace('> <', ' ').replace('><', '')
print(output)

output

how does this work now chisolm hisow
['ow', 'his', 'work']
ow 1
ow 20
ow 34
his 10
his 24
his 31
work 14
{32, 1, 34, 10, 11, 14, 15, 16, 20, 24, 25, 31}
h<ow> does t<his work> n<ow> c<his>olm <hisow>

Question 8

Thank you for this solution. If ngrams don’t have spaces, for instance ngrams = ["ow", "his", "work"], currently this produces h<ow> does t<his> <work> n<ow> c<his>olm. Could this be modified to ignore spaces and combine enclosed words if space (or linebreak) is separating them, like h<ow> does t<his work> n<ow> c<his>olm?

Question 9

Outstanding, this post-processing approach works great. Thank you!

Question 10

This should work:

txt = "how does this work"
ngrams = ["ow ", "his", "s w"]
# first find where letters match ngrams
L = len(txt)
match = [False]*L
for ng in ngrams:
 l = len(ng)
 for i in range(L-l):
 if txt[i:i+l] == ng:
 for j in range(l):
 match[i+j] = True
# then sandwich matches with quotes 
out = []
switch = False
for i in range(L):
 if not switch and match[i]:
 out.append('<')
 switch = True
 if switch and not match[i]:
 out.append('>')
 switch = False
 out.append(txt[i])
print "".join(out)

Question 11

output is h<ow d>oes t<his wo>rk and OP wants h<ow >does t<his w>ork

Question 12

@SandeepLade ok fixed it

Question 13

Man, this looks terrifying – 3 if clauses and 4 for loops to surround my matches with quotes... And i was expecting this part of the project to be the easier one. Thank you, i will check it out tomorrow. Will try to comprehend this.

Question 14

Note that using higher level functions like replace hide away loops that are still happening under the hood... And this is just "a" solution that works, by no mean have I spent any effort to find "the best" solution...

Question 15

I guess that’s what Python usually does overall. Would you mind adding a version with replace?

Question 16

Here's a method with only one for loop. I timed it and it's about as fast as the other answers to this question. I think it's a bit more clear, although that might be because I wrote it.

I iterate over the index of the first character in the n-gram, then if it matches, I use a bunch of if-else clauses to see whether I should add a < or > in this situation. I add to the end of the string output from the original txt, so I'm not really inserting in the middle of a string.

txt = "how does this work"
ngrams = set(["ow ", "his", "s w"])
n = 3
prev = -n
output = ''
shift = 0
open = False
for i in xrange(len(txt) - n + 1):
 ngram = txt[i:i + n]
 if ngram in ngrams:
 if i - prev > n:
 if open:
 output += txt[prev:prev + n] + '>' + txt[prev + n:i] + '<'
 elif not open:
 if prev > 0:
 output += txt[prev + n:i] + '<'
 else:
 output += txt[:i] + '<'
 open = True
 else:
 output += txt[prev:i]
 prev = i
if open:
 output += txt[prev:prev + n] + '>' + txt[prev + n:]
print output

PM 2Ring 55.6k6 gold badges96 silver badges203 bronze badges · Accepted Answer · 2017-11-06 05:05:43Z

This solution uses the str.find method to find all copies of an ngram within the txt string, saving the indices of each copy to the indices set so we can easily handle overlapping matches.

We then copy txt, char by char to the result list, inserting angle brackets where required. This strategy is more efficient than inserting the angle brackets using multiple .replace call because each .replace call needs to rebuild the whole string.

I've extended your data slightly to illustrate that my code handles multiple copies of an ngram.

txt = "how does this work now chisolm"
ngrams = ["ow ", "his", "s w"]
print(txt)
print(ngrams)
# Search for all copies of each ngram in txt
# saving the indices where the ngrams occur
indices = set()
for s in ngrams:
 slen = len(s)
 lo = 0
 while True:
 i = txt.find(s, lo)
 if i == -1:
 break
 lo = i + slen 
 print(s, i)
 indices.update(range(i, lo-1))
print(indices)
# Copy the txt to result, inserting angle brackets
# to show matches
switch = True
result = []
for i, u in enumerate(txt):
 if switch:
 if i in indices:
 result.append('<')
 switch = False
 result.append(u)
 else:
 result.append(u)
 if i not in indices:
 result.append('>')
 switch = True
print(''.join(result))

output

how does this work now chisolm
['ow ', 'his', 's w']
ow 1
ow 20
his 10
his 24
s w 12
{1, 2, 10, 11, 12, 13, 20, 21, 24, 25}
h<ow >does t<his w>ork n<ow >c<his>olm

If you want adjacent groups to be merged, we can easily do that using the str.replace method. But to make that work properly we need to pre-process the original data, converting all runs of whitespace to single spaces. A simple way to do that is to split the data and re-join it.

txt = "how does this\nwork now chisolm hisow"
ngrams = ["ow", "his", "work"]
#Convert all whitespace to single spaces
txt = ' '.join(txt.split())
print(txt)
print(ngrams)
# Search for all copies of each ngram in txt
# saving the indices where the ngrams occur
indices = set()
for s in ngrams:
 slen = len(s)
 lo = 0
 while True:
 i = txt.find(s, lo)
 if i == -1:
 break
 lo = i + slen 
 print(s, i)
 indices.update(range(i, lo-1))
print(indices)
# Copy the txt to result, inserting angle brackets
# to show matches
switch = True
result = []
for i, u in enumerate(txt):
 if switch:
 if i in indices:
 result.append('<')
 switch = False
 result.append(u)
 else:
 result.append(u)
 if i not in indices:
 result.append('>')
 switch = True
# Convert the list to a single string
output = ''.join(result)
# Merge adjacent groups
output = output.replace('> <', ' ').replace('><', '')
print(output)

output

how does this work now chisolm hisow
['ow', 'his', 'work']
ow 1
ow 20
ow 34
his 10
his 24
his 31
work 14
{32, 1, 34, 10, 11, 14, 15, 16, 20, 24, 25, 31}
h<ow> does t<his work> n<ow> c<his>olm <hisow>

Thank you for this solution. If ngrams don’t have spaces, for instance ngrams = ["ow", "his", "work"], currently this produces h<ow> does t<his> <work> n<ow> c<his>olm. Could this be modified to ignore spaces and combine enclosed words if space (or linebreak) is separating them, like h<ow> does t<his work> n<ow> c<his>olm?
Outstanding, this post-processing approach works great. Thank you!

CollectivesTM on Stack Overflow

Python: replace string, matched from a list

3 Answers 3

2 Comments

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

2 Comments

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related