Insert str into larger str in the most pythonic way

Question 1

I want to insert a small str into a larger one in the most pythonic way.

Maybe I missed a useful str method or function ? The function that insert the small str into the larger one is insert_tag.

I have tried something with reduce but it didn't work and it was too complex.

Code:

from random import choice, randrange
from typing import List
def insert_tag(text: str, offsets: List[int], tags: List[str]) -> str:
 text: str = text
 for offset, tag in zip(offsets, tags):
 # text = f'{text[:offset]}{tag}{text[offset:]}' # solution 1
 text = text[:offset] + tag + text[offset:] # solution 2
 return text
if __name__ == '__main__':
 tag_nbr = 30
 # Text example
 base_text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do ' \
 'eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut ' \
 'enim ad minim veniam, quis nostrud exercitation ullamco laboris ' \
 'nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor ' \
 'in reprehenderit in voluptate velit esse cillum dolore eu fugiat ' \
 'nulla pariatur. Excepteur sint occaecat cupidatat non proident, ' \
 'sunt in culpa qui officia deserunt mollit anim id est laborum.' \
 .replace(',', '').replace('.', '')
 # Random list of tag only for testing purpose
 tag_li: List[str] = [f'<{choice(base_text.split())}>' for _ in range(tag_nbr)]
 # Ordered list of random offset only for testing purpose
 offset_li: List[int] = [randrange(len(base_text)) for _ in range(tag_nbr)]
 offset_li.sort(reverse=True)
 # Debug
 print(tag_li, offset_li, sep='\n')
 # Review
 new_text: str = insert_tag(base_text, offset_li, tag_li)
 # End Review
 # Debug
 print(new_text)

Output:

"Lorem ips<consectetur>u<Lorem>m dolo<Lorem>r sit amet consecte<Lorem>tur <nostrud>adipiscing eli<aute>t sed d<anim>o eiusmod tempo<voluptate>r <exercitation>incid<anim>i<non>dunt ut l<quis>abor<do>e et dol<reprehenderit>ore magna aliqua Ut enim ad minim veniam quis nostrud exercitatio<est>n ullamco laboris nisi ut aliquip ex<nulla> ea <consectetur>commodo consequat Duis aute irure dolor in repreh<deserunt>enderit in voluptate vel<sint>it esse cillum dolore eu f<qui>ugiat nulla pariatur <magna>Excepteur sin<ex>t occaecat cup<labore>i<ut>datat non proident sun<est>t<in> in cul<fugiat>pa qui officia<enim> deserunt mollit anim i<in><anim>d est laborum"

PS: The rest of the code is not my primary concern, but if it can be improved, tell me, thanks.

Question 2

Please see What to do when someone answers. I have rolled back Rev 5 → 3

Question 3

Do not add an improved version of the code after receiving an answer. Including revised versions of the code makes the question confusing, especially if someone later reviews the newer code.

I didn't see that, thank you

Question 4

Besides your approach there are a few other alternatives to solve this problem: (1) collect segments of the input text (using values in offsets) and the tags into a list, then call ''.join(<list>) at the end to generate the output; (2) split text into segments, create a format string using '{}'.join(segments), then finally apply the format string to tags to generate the output.

Be aware that your approach consumes more memory and has a theoretical time complexity of \$O(n^2)\$ to join \$n\$ strings because memory reallocation is needed for each intermediate concatenated string. It could be more efficient in some cases on small \$n\$s but will be less efficient when \$n\$ becomes larger. Therefore, this is not recommended by PEP 8:

Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).

For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn't present at all in implementations that don't use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.

EDIT: Here is my version of performance measurements of various methods using the input format mentioned under the comment of this answer, where offsets and tags are 'merged' into a single list of tuples. It can be seen that CPython's optimization of concatenation operation using + does take effect and the performance of the best concatenation-based method is on par with several str.join-based methods when the text length (the tag count equals to text length in the experiments) is less than 600, but eventually loses out due to its quadratic complexity. However, as PEP8 suggests, similar performance cannot be expected from other Python interpreters.

Performance measurements:

 Mean exec time (us) Text Length (equals to tag count) 
Method 150 300 450 600 750 900 
tag_insert_no_op 19.02 35.58 52.44 70.42 87.27 104.33 
tag_insert_concat1 52.14 109.22 171.93 238.59 302.76 369.56 
tag_insert_concat2 66.05 146.94 247.86 359.57 475.49 590.77 
tag_insert_concat3 63.04 139.67 238.23 342.51 453.53 599.11 
tag_insert_join1 65.18 127.26 194.70 256.11 311.22 371.54 
tag_insert_join1_local 51.53 99.95 147.47 197.71 245.54 290.79 
tag_insert_join1_extend 50.36 95.91 143.63 192.76 238.07 281.91 
tag_insert_join1_yield 55.01 103.39 155.92 210.74 256.53 306.94 
tag_insert_join1_yield2 52.82 101.89 155.55 208.33 255.97 305.05 
tag_insert_join2 61.59 118.39 175.99 234.68 287.70 341.65 
tag_insert_join3 60.82 120.61 177.78 238.28 286.85 343.49 
tag_insert_join3_iter 59.65 117.34 172.51 235.55 282.96 339.15 
tag_insert_format 64.98 124.01 184.88 245.21 302.68 362.70 
tag_insert_string_io 54.08 108.49 161.08 214.71 264.93 315.87

Question 5

The offset still valid since I do offset_li.sort(reverse=True)

Question 6

For some measurements of time efficiency (memory efficiency ignored), refer to codereview.stackexchange.com/a/228822/25834

Question 7

@DorianTurba I have added my implementation of several methods with performance measurements.

Question 8

I accept your answer because solutions you provide does make the memory improvement of the answer of Scnerd, and is more understandable. Since Pythonic is my primary concern, your answer suit better the question.

Question 9

The core code (I'd prefer the f-string if you can guarantee Python 3.6+, else the addition or "{}{}{}".format(...) if you need to support older Python versions) looks fine. However, you overwrite text each loop, but continue to use the originally provided offset's, which could be very confusing since the offsets will become incorrect after the first tag insertion. The code doesn't crash, but I'd argue it doesn't "work", certainly not as I'd intuitively expect:

In [1]: insert_tag('abc', [1, 2], ['x', 'y'])
Out[1]: 'axyyxbc'

I'd expect this to produce 'axbyc'. If that output looks correct to you, ignore the rest of this answer.

It would be better to break the original string apart first, then perform the insertions. As is typical in Python, we want to avoid manual looping if possible, and as is typical for string generation, we want to create as few new strings as possible. I'd try something like the following:

In [2]: def insert_tags(text: str, offsets: [int], tags: [str]):
 ...: offsets = [0] + list(offsets)
 ...: chunks = [(text[prev_offset:cur_offset], tag) for prev_offset, cur_offset, tag in zip(offsets[:-1], offset
 ...: s[1:], tags)]
 ...: chunks.append([text[offsets[-1]:]])
 ...: return ''.join(piece for chunk_texts in chunks for piece in chunk_texts)
In [3]: insert_tags('abc', [1, 2], ['x', 'y'])
Out[3]: 'axbyc'

Other stylistic considerations:

Your function supports the insertion of multiple tags, so it should probably be called "insert_tags" (plural) for clarity.
Type annotation are generally good (though recall that they lock you in to newer versions of Python 3.X), but getting them right can be a nuisance. E.g., in this case I'd suggest using Sequence instead of List, since you really just need to be able to loop on the offsets and tags (e.g., they could also be tuples, sets, or even dicts with interesting keys). I also find them to often be more clutter than useful, so consider if adding type annotations really makes this code clearer, or if something simpler, like a succinct docstring, might be more helpful.

Question 10

The offset still valid since I do offset_li.sort(reverse=True), I update the output to show it.

Question 11

That should either be built into the function (something like offsets, tags = zip(*sorted(zip(offsets, tags), reverse=True))) or made crystal clear in the function's documentation. The function shouldn't fail horribly like I show because the user expects the function to behave the same regardless of the order of the offsets.

Question 12

To answer the question in your answer : with insert_tag('abc', [1, 2], ['x', 'y']) you will get 'axybc'

Question 13

The fact is that in the real program, tag and offset is linked together because they are in the same object, so def of insert_tag will look like this : def insert_tag(tags: List[Tuple(int, str)]

Question 14

That updated type annotation is still overly specific. You don't need the tags to be a list, you just need to be able to iterate over them. As I mentioned, making good type annotations can be helpful, but more difficult that it seems like it should be. If you require a List, everyone using your code will have IDE/lint warnings thrown at them unless they cast their objects to lists, which shouldn't be necessary and may, in some cases, be very badly advised.

Question 15

Ignoring for now the various gotchas that @scnerd describes about offsets, and making some broad assumptions about the monotonicity and order of the input, let's look at the performance of various implementations. Happily, your implementations are close to the most efficient given these options; only the shown "naive" method is faster on my machine:

#!/usr/bin/env python3
from io import StringIO
from timeit import timeit
# Assume that 'offsets' decreases monotonically
def soln1(text: str, offsets: tuple, tags: tuple) -> str:
 for offset, tag in zip(offsets, tags):
 text = f'{text[:offset]}{tag}{text[offset:]}'
 return text
def soln2(text: str, offsets: tuple, tags: tuple) -> str:
 for offset, tag in zip(offsets, tags):
 text = text[:offset] + tag + text[offset:]
 return text
def gen_join(text: str, offsets: tuple, tags: tuple) -> str:
 offsets += (0,)
 return ''.join(
 text[offsets[i+1]:offsets[i]] + tag
 for i, tag in reversed(tuple(enumerate(tags)))
 ) + text[offsets[0]:]
def naive(text: str, offsets: tuple, tags: tuple) -> str:
 output = text[:offsets[-1]]
 for i in range(len(tags)-1,-1,-1):
 output += tags[i] + text[offsets[i]:offsets[i-1]]
 return output + text[offsets[0]:]
def strio(text: str, offsets: tuple, tags: tuple) -> str:
 output = StringIO()
 output.write(text[:offsets[-1]])
 for i in range(len(tags)-1,-1,-1):
 output.write(tags[i])
 output.write(text[offsets[i]:offsets[i-1]])
 output.write(text[offsets[0]:])
 return output.getvalue()
def ranges(text: str, offsets: tuple, tags: tuple) -> str:
 final_len = len(text) + sum(len(t) for t in tags)
 output = [None]*final_len
 offsets += (0,)
 begin_text = 0
 for i in range(len(tags)-1,-1,-1):
 o1, o2 = offsets[i+1], offsets[i]
 end_text = begin_text + o2 - o1
 output[begin_text: end_text] = text[o1: o2]
 tag = tags[i]
 end_tag = end_text + len(tag)
 output[end_text: end_tag] = tag
 begin_text = end_tag
 output[begin_text:] = text[offsets[0]:]
 return ''.join(output)
def insertion(text: str, offsets: tuple, tags: tuple) -> str:
 output = []
 offsets = (1+len(tags),) + offsets
 for i in range(len(tags)):
 output.insert(0, text[offsets[i+1]: offsets[i]])
 output.insert(0, tags[i])
 output.insert(0, text[:offsets[-1]])
 return ''.join(output)
def test(fun):
 actual = fun('abcde', (4, 3, 2, 1), ('D', 'C', 'B', 'A'))
 expected = 'aAbBcCdDe'
 assert expected == actual
def main():
 funs = (soln1, soln2, gen_join, naive, strio, ranges, insertion)
 for fun in funs:
 test(fun)
 N = 5_000 # number of averaging repetitions
 L = 150 # input length
 text = 'abcde' * (L//5)
 offsets = tuple(range(L-1, 0, -1))
 tags = text[-1:0:-1].upper()
 print(f'{"Name":10} {"time/call, us":15}')
 for fun in funs:
 def call():
 return fun(text, offsets, tags)
 dur = timeit(stmt=call, number=N)
 print(f'{fun.__name__:10} {dur/N*1e6:.1f}')
main()

This yields:

Name time/call, us 
soln1 134.2
soln2 133.2
gen_join 289.7
naive 116.8
strio 159.6
ranges 513.9
insertion 204.7

Question 16

One problem of the tests is that it assumes that offsets is in descending order, which adds unnecessary overhead to all the str.join implementations here.

Question 17

@GZ0 The alternatives - probably adding a sort() somewhere, to be more robust - are probably going to incur more cost than assuming descending order. Iterating in reverse order is efficient, especially with indexes. Only one of the methods calls reversed, and it's difficult to say whether that's the limiting factor in speed of that method.

Question 18

Meanwhile, the slowness of insertion is due to all the insertions happening at the beginning of the list. The implementation would not be so awkward if not because of the decending order.

Question 19

I have added my version of performance tests in my post. I tried my best to optimize each method to achieve its best performance. For some methods, several variations are implemented for comparison.

Question 20

@GZ0 I'm glad that you did some performance analysis and updated your answer. Note that the tuple call is not redundant because reversed cannot accept an iterator.

GZ0 GZ0 2,3318 silver badges19 bronze badges · Accepted Answer · 2019-09-10 14:28:24Z

Besides your approach there are a few other alternatives to solve this problem: (1) collect segments of the input text (using values in offsets) and the tags into a list, then call ''.join(<list>) at the end to generate the output; (2) split text into segments, create a format string using '{}'.join(segments), then finally apply the format string to tags to generate the output.

Be aware that your approach consumes more memory and has a theoretical time complexity of \$O(n^2)\$ to join \$n\$ strings because memory reallocation is needed for each intermediate concatenated string. It could be more efficient in some cases on small \$n\$s but will be less efficient when \$n\$ becomes larger. Therefore, this is not recommended by PEP 8:

Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).

For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn't present at all in implementations that don't use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.

EDIT: Here is my version of performance measurements of various methods using the input format mentioned under the comment of this answer, where offsets and tags are 'merged' into a single list of tuples. It can be seen that CPython's optimization of concatenation operation using + does take effect and the performance of the best concatenation-based method is on par with several str.join-based methods when the text length (the tag count equals to text length in the experiments) is less than 600, but eventually loses out due to its quadratic complexity. However, as PEP8 suggests, similar performance cannot be expected from other Python interpreters.

Performance measurements:

 Mean exec time (us) Text Length (equals to tag count) 
Method 150 300 450 600 750 900 
tag_insert_no_op 19.02 35.58 52.44 70.42 87.27 104.33 
tag_insert_concat1 52.14 109.22 171.93 238.59 302.76 369.56 
tag_insert_concat2 66.05 146.94 247.86 359.57 475.49 590.77 
tag_insert_concat3 63.04 139.67 238.23 342.51 453.53 599.11 
tag_insert_join1 65.18 127.26 194.70 256.11 311.22 371.54 
tag_insert_join1_local 51.53 99.95 147.47 197.71 245.54 290.79 
tag_insert_join1_extend 50.36 95.91 143.63 192.76 238.07 281.91 
tag_insert_join1_yield 55.01 103.39 155.92 210.74 256.53 306.94 
tag_insert_join1_yield2 52.82 101.89 155.55 208.33 255.97 305.05 
tag_insert_join2 61.59 118.39 175.99 234.68 287.70 341.65 
tag_insert_join3 60.82 120.61 177.78 238.28 286.85 343.49 
tag_insert_join3_iter 59.65 117.34 172.51 235.55 282.96 339.15 
tag_insert_format 64.98 124.01 184.88 245.21 302.68 362.70 
tag_insert_string_io 54.08 108.49 161.08 214.71 264.93 315.87

The offset still valid since I do offset_li.sort(reverse=True)
For some measurements of time efficiency (memory efficiency ignored), refer to codereview.stackexchange.com/a/228822/25834
@DorianTurba I have added my implementation of several methods with performance measurements.
I accept your answer because solutions you provide does make the memory improvement of the answer of Scnerd, and is more understandable. Since Pythonic is my primary concern, your answer suit better the question.

Stack Exchange Network

Insert str into larger str in the most pythonic way

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Insert str into larger str in the most pythonic way

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions