I want to insert a small str into a larger one in the most pythonic way.
Maybe I missed a useful str method or function ? The function that insert the small str into the larger one is insert_tag
.
I have tried something with reduce but it didn't work and it was too complex.
Code:
from random import choice, randrange
from typing import List
def insert_tag(text: str, offsets: List[int], tags: List[str]) -> str:
text: str = text
for offset, tag in zip(offsets, tags):
# text = f'{text[:offset]}{tag}{text[offset:]}' # solution 1
text = text[:offset] + tag + text[offset:] # solution 2
return text
if __name__ == '__main__':
tag_nbr = 30
# Text example
base_text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do ' \
'eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut ' \
'enim ad minim veniam, quis nostrud exercitation ullamco laboris ' \
'nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor ' \
'in reprehenderit in voluptate velit esse cillum dolore eu fugiat ' \
'nulla pariatur. Excepteur sint occaecat cupidatat non proident, ' \
'sunt in culpa qui officia deserunt mollit anim id est laborum.' \
.replace(',', '').replace('.', '')
# Random list of tag only for testing purpose
tag_li: List[str] = [f'<{choice(base_text.split())}>' for _ in range(tag_nbr)]
# Ordered list of random offset only for testing purpose
offset_li: List[int] = [randrange(len(base_text)) for _ in range(tag_nbr)]
offset_li.sort(reverse=True)
# Debug
print(tag_li, offset_li, sep='\n')
# Review
new_text: str = insert_tag(base_text, offset_li, tag_li)
# End Review
# Debug
print(new_text)
Output:
"Lorem ips<consectetur>u<Lorem>m dolo<Lorem>r sit amet consecte<Lorem>tur <nostrud>adipiscing eli<aute>t sed d<anim>o eiusmod tempo<voluptate>r <exercitation>incid<anim>i<non>dunt ut l<quis>abor<do>e et dol<reprehenderit>ore magna aliqua Ut enim ad minim veniam quis nostrud exercitatio<est>n ullamco laboris nisi ut aliquip ex<nulla> ea <consectetur>commodo consequat Duis aute irure dolor in repreh<deserunt>enderit in voluptate vel<sint>it esse cillum dolore eu f<qui>ugiat nulla pariatur <magna>Excepteur sin<ex>t occaecat cup<labore>i<ut>datat non proident sun<est>t<in> in cul<fugiat>pa qui officia<enim> deserunt mollit anim i<in><anim>d est laborum"
PS: The rest of the code is not my primary concern, but if it can be improved, tell me, thanks.
3 Answers 3
Besides your approach there are a few other alternatives to solve this problem: (1) collect segments of the input text
(using values in offsets
) and the tags into a list
, then call ''.join(<list>)
at the end to generate the output; (2) split text
into segments, create a format string using '{}'.join(segments)
, then finally apply the format string to tags
to generate the output.
Be aware that your approach consumes more memory and has a theoretical time complexity of \$O(n^2)\$ to join \$n\$ strings because memory reallocation is needed for each intermediate concatenated string. It could be more efficient in some cases on small \$n\$s but will be less efficient when \$n\$ becomes larger. Therefore, this is not recommended by PEP 8:
- Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).
For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn't present at all in implementations that don't use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.
EDIT: Here is my version of performance measurements of various methods using the input format mentioned under the comment of this answer, where offsets
and tags
are 'merged' into a single list of tuples. It can be seen that CPython's optimization of concatenation operation using +
does take effect and the performance of the best concatenation-based method is on par with several str.join
-based methods when the text length (the tag count equals to text length in the experiments) is less than 600, but eventually loses out due to its quadratic complexity. However, as PEP8 suggests, similar performance cannot be expected from other Python interpreters.
Performance measurements:
Mean exec time (us) Text Length (equals to tag count)
Method 150 300 450 600 750 900
tag_insert_no_op 19.02 35.58 52.44 70.42 87.27 104.33
tag_insert_concat1 52.14 109.22 171.93 238.59 302.76 369.56
tag_insert_concat2 66.05 146.94 247.86 359.57 475.49 590.77
tag_insert_concat3 63.04 139.67 238.23 342.51 453.53 599.11
tag_insert_join1 65.18 127.26 194.70 256.11 311.22 371.54
tag_insert_join1_local 51.53 99.95 147.47 197.71 245.54 290.79
tag_insert_join1_extend 50.36 95.91 143.63 192.76 238.07 281.91
tag_insert_join1_yield 55.01 103.39 155.92 210.74 256.53 306.94
tag_insert_join1_yield2 52.82 101.89 155.55 208.33 255.97 305.05
tag_insert_join2 61.59 118.39 175.99 234.68 287.70 341.65
tag_insert_join3 60.82 120.61 177.78 238.28 286.85 343.49
tag_insert_join3_iter 59.65 117.34 172.51 235.55 282.96 339.15
tag_insert_format 64.98 124.01 184.88 245.21 302.68 362.70
tag_insert_string_io 54.08 108.49 161.08 214.71 264.93 315.87
-
\$\begingroup\$ The offset still valid since I do
offset_li.sort(reverse=True)
\$\endgroup\$Dorian Turba– Dorian Turba2019年09月10日 14:37:12 +00:00Commented Sep 10, 2019 at 14:37 -
\$\begingroup\$ For some measurements of time efficiency (memory efficiency ignored), refer to codereview.stackexchange.com/a/228822/25834 \$\endgroup\$Reinderien– Reinderien2019年09月11日 02:27:38 +00:00Commented Sep 11, 2019 at 2:27
-
1\$\begingroup\$ @DorianTurba I have added my implementation of several methods with performance measurements. \$\endgroup\$GZ0– GZ02019年09月11日 10:01:18 +00:00Commented Sep 11, 2019 at 10:01
-
\$\begingroup\$ I accept your answer because solutions you provide does make the memory improvement of the answer of Scnerd, and is more understandable. Since Pythonic is my primary concern, your answer suit better the question. \$\endgroup\$Dorian Turba– Dorian Turba2019年09月11日 12:22:08 +00:00Commented Sep 11, 2019 at 12:22
The core code (I'd prefer the f-string if you can guarantee Python 3.6+, else the addition or "{}{}{}".format(...)
if you need to support older Python versions) looks fine. However, you overwrite text
each loop, but continue to use the originally provided offset
's, which could be very confusing since the offsets will become incorrect after the first tag insertion. The code doesn't crash, but I'd argue it doesn't "work", certainly not as I'd intuitively expect:
In [1]: insert_tag('abc', [1, 2], ['x', 'y'])
Out[1]: 'axyyxbc'
I'd expect this to produce 'axbyc'
. If that output looks correct to you, ignore the rest of this answer.
It would be better to break the original string apart first, then perform the insertions. As is typical in Python, we want to avoid manual looping if possible, and as is typical for string generation, we want to create as few new strings as possible. I'd try something like the following:
In [2]: def insert_tags(text: str, offsets: [int], tags: [str]):
...: offsets = [0] + list(offsets)
...: chunks = [(text[prev_offset:cur_offset], tag) for prev_offset, cur_offset, tag in zip(offsets[:-1], offset
...: s[1:], tags)]
...: chunks.append([text[offsets[-1]:]])
...: return ''.join(piece for chunk_texts in chunks for piece in chunk_texts)
In [3]: insert_tags('abc', [1, 2], ['x', 'y'])
Out[3]: 'axbyc'
Other stylistic considerations:
Your function supports the insertion of multiple tags, so it should probably be called "insert_tags" (plural) for clarity.
Type annotation are generally good (though recall that they lock you in to newer versions of Python 3.X), but getting them right can be a nuisance. E.g., in this case I'd suggest using
Sequence
instead ofList
, since you really just need to be able to loop on the offsets and tags (e.g., they could also be tuples, sets, or even dicts with interesting keys). I also find them to often be more clutter than useful, so consider if adding type annotations really makes this code clearer, or if something simpler, like a succinct docstring, might be more helpful.
-
\$\begingroup\$ The offset still valid since I do offset_li.sort(reverse=True), I update the output to show it. \$\endgroup\$Dorian Turba– Dorian Turba2019年09月10日 14:38:30 +00:00Commented Sep 10, 2019 at 14:38
-
3\$\begingroup\$ That should either be built into the function (something like
offsets, tags = zip(*sorted(zip(offsets, tags), reverse=True))
) or made crystal clear in the function's documentation. The function shouldn't fail horribly like I show because the user expects the function to behave the same regardless of the order of the offsets. \$\endgroup\$scnerd– scnerd2019年09月10日 14:41:39 +00:00Commented Sep 10, 2019 at 14:41 -
\$\begingroup\$ To answer the question in your answer : with
insert_tag('abc', [1, 2], ['x', 'y'])
you will get'axybc'
\$\endgroup\$Dorian Turba– Dorian Turba2019年09月10日 14:45:19 +00:00Commented Sep 10, 2019 at 14:45 -
\$\begingroup\$ The fact is that in the real program, tag and offset is linked together because they are in the same object, so def of
insert_tag
will look like this :def insert_tag(tags: List[Tuple(int, str)]
\$\endgroup\$Dorian Turba– Dorian Turba2019年09月10日 14:48:19 +00:00Commented Sep 10, 2019 at 14:48 -
\$\begingroup\$ That updated type annotation is still overly specific. You don't need the tags to be a list, you just need to be able to iterate over them. As I mentioned, making good type annotations can be helpful, but more difficult that it seems like it should be. If you require a List, everyone using your code will have IDE/lint warnings thrown at them unless they cast their objects to lists, which shouldn't be necessary and may, in some cases, be very badly advised. \$\endgroup\$scnerd– scnerd2019年09月10日 15:00:54 +00:00Commented Sep 10, 2019 at 15:00
Ignoring for now the various gotchas that @scnerd describes about offsets, and making some broad assumptions about the monotonicity and order of the input, let's look at the performance of various implementations. Happily, your implementations are close to the most efficient given these options; only the shown "naive" method is faster on my machine:
#!/usr/bin/env python3
from io import StringIO
from timeit import timeit
# Assume that 'offsets' decreases monotonically
def soln1(text: str, offsets: tuple, tags: tuple) -> str:
for offset, tag in zip(offsets, tags):
text = f'{text[:offset]}{tag}{text[offset:]}'
return text
def soln2(text: str, offsets: tuple, tags: tuple) -> str:
for offset, tag in zip(offsets, tags):
text = text[:offset] + tag + text[offset:]
return text
def gen_join(text: str, offsets: tuple, tags: tuple) -> str:
offsets += (0,)
return ''.join(
text[offsets[i+1]:offsets[i]] + tag
for i, tag in reversed(tuple(enumerate(tags)))
) + text[offsets[0]:]
def naive(text: str, offsets: tuple, tags: tuple) -> str:
output = text[:offsets[-1]]
for i in range(len(tags)-1,-1,-1):
output += tags[i] + text[offsets[i]:offsets[i-1]]
return output + text[offsets[0]:]
def strio(text: str, offsets: tuple, tags: tuple) -> str:
output = StringIO()
output.write(text[:offsets[-1]])
for i in range(len(tags)-1,-1,-1):
output.write(tags[i])
output.write(text[offsets[i]:offsets[i-1]])
output.write(text[offsets[0]:])
return output.getvalue()
def ranges(text: str, offsets: tuple, tags: tuple) -> str:
final_len = len(text) + sum(len(t) for t in tags)
output = [None]*final_len
offsets += (0,)
begin_text = 0
for i in range(len(tags)-1,-1,-1):
o1, o2 = offsets[i+1], offsets[i]
end_text = begin_text + o2 - o1
output[begin_text: end_text] = text[o1: o2]
tag = tags[i]
end_tag = end_text + len(tag)
output[end_text: end_tag] = tag
begin_text = end_tag
output[begin_text:] = text[offsets[0]:]
return ''.join(output)
def insertion(text: str, offsets: tuple, tags: tuple) -> str:
output = []
offsets = (1+len(tags),) + offsets
for i in range(len(tags)):
output.insert(0, text[offsets[i+1]: offsets[i]])
output.insert(0, tags[i])
output.insert(0, text[:offsets[-1]])
return ''.join(output)
def test(fun):
actual = fun('abcde', (4, 3, 2, 1), ('D', 'C', 'B', 'A'))
expected = 'aAbBcCdDe'
assert expected == actual
def main():
funs = (soln1, soln2, gen_join, naive, strio, ranges, insertion)
for fun in funs:
test(fun)
N = 5_000 # number of averaging repetitions
L = 150 # input length
text = 'abcde' * (L//5)
offsets = tuple(range(L-1, 0, -1))
tags = text[-1:0:-1].upper()
print(f'{"Name":10} {"time/call, us":15}')
for fun in funs:
def call():
return fun(text, offsets, tags)
dur = timeit(stmt=call, number=N)
print(f'{fun.__name__:10} {dur/N*1e6:.1f}')
main()
This yields:
Name time/call, us
soln1 134.2
soln2 133.2
gen_join 289.7
naive 116.8
strio 159.6
ranges 513.9
insertion 204.7
-
\$\begingroup\$ One problem of the tests is that it assumes that
offsets
is in descending order, which adds unnecessary overhead to all thestr.join
implementations here. \$\endgroup\$GZ0– GZ02019年09月11日 03:25:24 +00:00Commented Sep 11, 2019 at 3:25 -
1\$\begingroup\$ @GZ0 The alternatives - probably adding a
sort()
somewhere, to be more robust - are probably going to incur more cost than assuming descending order. Iterating in reverse order is efficient, especially with indexes. Only one of the methods callsreversed
, and it's difficult to say whether that's the limiting factor in speed of that method. \$\endgroup\$Reinderien– Reinderien2019年09月11日 03:28:04 +00:00Commented Sep 11, 2019 at 3:28 -
1\$\begingroup\$ Meanwhile, the slowness of
insertion
is due to all the insertions happening at the beginning of the list. The implementation would not be so awkward if not because of the decending order. \$\endgroup\$GZ0– GZ02019年09月11日 03:50:10 +00:00Commented Sep 11, 2019 at 3:50 -
1\$\begingroup\$ I have added my version of performance tests in my post. I tried my best to optimize each method to achieve its best performance. For some methods, several variations are implemented for comparison. \$\endgroup\$GZ0– GZ02019年09月11日 10:15:02 +00:00Commented Sep 11, 2019 at 10:15
-
\$\begingroup\$ @GZ0 I'm glad that you did some performance analysis and updated your answer. Note that the
tuple
call is not redundant becausereversed
cannot accept an iterator. \$\endgroup\$Reinderien– Reinderien2019年09月11日 12:33:36 +00:00Commented Sep 11, 2019 at 12:33
Do not add an improved version of the code after receiving an answer. Including revised versions of the code makes the question confusing, especially if someone later reviews the newer code.
I didn't see that, thank you \$\endgroup\$