Repeatedly remove a substring quickly

Question 1

I'm trying to solve the USACO problem Censoring (Bronze), which was the first problem for the 2015 February contest. My solution works for some test cases, but then times out for test cases 7-15. I expected it to run with O(n) time complexity because of the while loop, which should work, but it didn't.

Problem:

Farmer John has purchased a subscription to Good Hooveskeeping magazine for his cows, so they have plenty of material to read while waiting around in the barn during milking sessions. Unfortunately, the latest issue contains a rather inappropriate article on how to cook the perfect steak, which FJ would rather his cows not see (clearly, the magazine is in need of better editorial oversight).

FJ has taken all of the text from the magazine to create the string S of length at most 10^6 characters. From this, he would like to remove occurrences of a substring T of length <= 100 characters to censor the inappropriate content. To do this, Farmer John finds the first occurrence of T in S and deletes it. He then repeats the process again, deleting the first occurrence of T again, continuing until there are no more occurrences of T in S. Note that the deletion of one occurrence might create a new occurrence of T that didn't exist before.

Please help FJ determine the final contents of S after censoring is complete.

INPUT FORMAT: (file censor.in) The first line will contain S. The second line will contain T. The length of T will be at most that of S, and all characters of S and T will be lower-case alphabet characters (in the range a..z).

OUTPUT FORMAT: (file censor.out) The string S after all deletions are complete. It is guaranteed that S will not become empty during the deletion process.

# Censoring (Bronze)
input = open('censor.in', 'r').read().split('\n')
s = input[0]
c = input[1]
l = len(c)
while True:
 try:
 i = s.index(c)
 s = s[:i] + s[i + l:]
 except:
 break
output = open('censor.out', 'w')
output.write(s)
output.close()

How do I speed up the code? I tried to use Numba, but Numba doesn't work in USACO IDE. This was supposed to solve all the test cases under the time limit, which is 5 seconds.

Question 2

Assumption: len(c) << len(s). For example, when censoring "moo".

You asked if one can

speed up python

but really this is a matter of "speeding up an algorithm". That is, you have accidentally fallen into a higher time complexity than necessary.

I expected it to run with O(n) time complexity

Certainly the while loop is linear in size of article string s. But you're restarting that linear .index() search from the beginning each time, leading to nested loop \$O(N ×ばつ M)\$ behavior if the censored c appears \$M\$ times within the article. Given \$M << N\$ an efficient algorithm should see article size dominate the cost, with negligible contribution from the censoring operations.

In addition to .index() being linear in len(s), notice that producing a new immutable str with string catenation is also \$O(len(s))\$.

input = open('censor.in', 'r').read().split('\n')

A more natural method to call here would have been .readlines().

Rather than leaving an open file descriptor resource lying around until end-of-scope, prefer to use a with context handler so it closes when you're done with it.

Shadowing the builtin input() is definitely not Best Practice.

s = input[0]
c = input[1]

Usually we give 0 and 1 a pass when it comes to calling out magic numbers. But here it obscures Author's Intent, there's no need for such cryptic indexes. Prefer a tuple unpack:

s, c = input

Also, even though these are local variables in a very short piece of code, please consider using more informative names such as article and censored.

 except:

No.

Do not use "bare" except, as it will sometimes swallow exceptions such as KeyboardInterrupt which we really want to be propagated. Prefer to habitually write except Exception:

But here we know perfectly well what we're hoping to trap, so spell it out for the Gentle Reader:

 except ValueError: # substring not found

 i = s.index(c)

Suppose we have a million-character article, and we've already deleted a thousand instances of "moo". This .index() call is super expensive -- it's trolling through roughly a megabyte of article text that we've already repeatedly scanned before. We know there's a giant prefix which is entirely "moo"-free.

.index() offers an optional start parameter. Use it. You'll have to maintain another temp variable. Since deleting "moo" from "momooo" can expose another instance, you should conservatively use j = s.index(c, j) - len(c). This will still yield \$O(len(s))\$ complexity, given our assumption.

 s = s[:i] + s[i + l:]

As noted above, this is linear in len(s). Rather than reading a pair of giant immutable strings and writing a new giant immutable string, prefer a tombstone approach.

Start with this import:

from array import array

Spend \$O(n)\$ time copying article text into an array. If articles are guaranteed ASCII text then we can get away with a byte array, else we need an array of ints.

Implement your own version of .index() which

compares against article text in the array, and
knows that it must skip over (ignore) any tombstone entries

Now "deleting" a censored string from the article is a matter of overwriting each character with a sentinel tombstone value such as 0.

Finally at the end you'll need to do linear work with "".join( ... ) to copy out non-sentinel Unicode code points, producing the final sanitized article that is safe for cows to read.

Question 3

Can you give me the full code you are talking about? Thanks.

Question 4

It timed out for test cases 7-15. I expected it to run with O(n) time complexity because of the while loop, which should work, but it didn't. What's wrong?

Lets look at the Common Sequence Operators:

Operation Result Notes

... ... ...

s + t the concatenation of s and t (6)(7)

...

Concatenating immutable sequences always results in a new object. This means that building up a sequence by repeated concatenation will have a quadratic runtime cost in the total sequence length. To get a linear runtime cost, you must switch to one of the alternatives below:

if concatenating str objects, you can build a list and use str.join() at the end or else write to an io.StringIO instance and retrieve its value when complete

You should build a list using list.append(), then build the string with str.join().

_{Note: untested}

output: list[str] = []
prev = 0
try:
 while True:
 curr = s.index(c, prev)
 output.append(s[prev:curr])
 prev = curr + len(c)
except ValueError:
 output.append(s[prev:])
s = "".join(output)

Even better you can actually just use str.replace.

s = s.replace(c, "")

Question 5

prev = curr + len(c) doesn't account for occurrences appearing after deletion. check = curr - len(c) + 1 needs a whole new logic to assemble the output.

Question 6

@greybeard Ah yeah I didn't see "Note that the deletion of one occurrence might create a new occurrence of T that didn't exist before." in the description. A simple approach is to just wrap s.replace(c, "") in a while loop until the replace does nothing... A performant approach on the other hand may be a little more intricate to write.

Question 7

(I've come up with two stacks... and ditched it: to much additional space (especially with Python). A counter for incomplete possible censored words is enough.)

Question 8

@greybeard I'd probably go with (s[curr - len(c):curr] + s[curr + len(c):curr + 2*len(c)]).index(c) in an inner while loop. The algorithm becomes a little more complicated on the whole. I'd use str.index over a Trie because Python's so slow I'd expect the str.index to perform better, with a lower complexity score.

Question 9

Essentially, the issue in the original code is that you are constructing a lot of big strings. Each indexing operation is O(C), each rebuilding is O(C), and you may need to rebuild C times. So your original code is O(C2). It may also be helpful for you to separate the solver from the input handling, as I tried to do below:

def solver(string_s, censor_t):
 c = len(string_s)
 d = len(censor_t)
 last_symbol = censor_t[-1]
 # emulate a stack; end is the # of symbols in the stack
 stack_arr = [None] * c
 stack_end = 0
 for symbol in string_s:
 # emulate pushing symbol to the stack
 stack_arr[stack_end] = symbol
 stack_end += 1
 if stack_end >= d and symbol == last_symbol:
 if "".join(stack_arr[(stack_end - d):stack_end]) == censor_t:
 # emulate popping d elements from the stack
 stack_end -= d
 return "".join(stack_arr[:stack_end])
line1, line2 = open('censor.in', 'r').read().split('\n')
output = open('censor.out', 'w')
output.write(solver(line1, line2))
output.close()

Reasoning

Let D be the length of the censor T and let C be the length of the string S. Create an empty list called stack (which is what we’ll treat it as). Add S to stack, 1 character at a time. Check if the last D elements (characters) of stack would form T. If so, pop the last D characters of stack. Repeat until S is completely traversed. Then print out stack. Since each check takes D comparisons and we do at most C checks / adds / removes, this is O(CD) time.

Further improvements:

only check whenever you add the last character of T to stack.
only check whenever stack is of size D or greater.
create a dictionary counting the frequency of each character in stack and only check when stack has enough of each character to make T.

Testing

I put 2000000 'a' and then 1000000 'b' with the censor 'aab'. The initial code takes over 222.49 seconds. The revised takes 0.64 seconds.

output = open("censor2.in", "w")
value = "aab"
output.write(''.join([char*1000000 for char in value]))
output.write("\naab")
output.close()

Note

If we were in a language like C, you could also achieve O(1) space by cannibalizing the original string to make a stack. Do you see how?

Question 10

Thanks! This also works. Could you show me how to make the original string into a stack in C++?

Question 11

Sure, I will give it a shot when I get the time.

Question 12

Sorry, I'm having some IDE issues. In C, strings are just arrays of characters. So you could delete the declaration of stack_arr and use string_s whenever you subsequently use stack_arr. The pseudo-stack will never run out of space because string_s initially held all the characters to begin with.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Accepted Answer · 2023-07-28 01:08:33Z

Assumption: len(c) << len(s). For example, when censoring "moo".

You asked if one can

speed up python

but really this is a matter of "speeding up an algorithm". That is, you have accidentally fallen into a higher time complexity than necessary.

I expected it to run with O(n) time complexity

Certainly the while loop is linear in size of article string s. But you're restarting that linear .index() search from the beginning each time, leading to nested loop \$O(N ×ばつ M)\$ behavior if the censored c appears \$M\$ times within the article. Given \$M << N\$ an efficient algorithm should see article size dominate the cost, with negligible contribution from the censoring operations.

In addition to .index() being linear in len(s), notice that producing a new immutable str with string catenation is also \$O(len(s))\$.

input = open('censor.in', 'r').read().split('\n')

A more natural method to call here would have been .readlines().

Rather than leaving an open file descriptor resource lying around until end-of-scope, prefer to use a with context handler so it closes when you're done with it.

Shadowing the builtin input() is definitely not Best Practice.

s = input[0]
c = input[1]

Usually we give 0 and 1 a pass when it comes to calling out magic numbers. But here it obscures Author's Intent, there's no need for such cryptic indexes. Prefer a tuple unpack:

s, c = input

Also, even though these are local variables in a very short piece of code, please consider using more informative names such as article and censored.

 except:

No.

Do not use "bare" except, as it will sometimes swallow exceptions such as KeyboardInterrupt which we really want to be propagated. Prefer to habitually write except Exception:

But here we know perfectly well what we're hoping to trap, so spell it out for the Gentle Reader:

 except ValueError: # substring not found

 i = s.index(c)

Suppose we have a million-character article, and we've already deleted a thousand instances of "moo". This .index() call is super expensive -- it's trolling through roughly a megabyte of article text that we've already repeatedly scanned before. We know there's a giant prefix which is entirely "moo"-free.

.index() offers an optional start parameter. Use it. You'll have to maintain another temp variable. Since deleting "moo" from "momooo" can expose another instance, you should conservatively use j = s.index(c, j) - len(c). This will still yield \$O(len(s))\$ complexity, given our assumption.

 s = s[:i] + s[i + l:]

As noted above, this is linear in len(s). Rather than reading a pair of giant immutable strings and writing a new giant immutable string, prefer a tombstone approach.

Start with this import:

from array import array

Spend \$O(n)\$ time copying article text into an array. If articles are guaranteed ASCII text then we can get away with a byte array, else we need an array of ints.

Implement your own version of .index() which

compares against article text in the array, and
knows that it must skip over (ignore) any tombstone entries

Now "deleting" a censored string from the article is a matter of overwriting each character with a sentinel tombstone value such as 0.

Finally at the end you'll need to do linear work with "".join( ... ) to copy out non-sentinel Unicode code points, producing the final sanitized article that is safe for cows to read.

Can you give me the full code you are talking about? Thanks.

Stack Exchange Network

Repeatedly remove a substring quickly

3 Answers 3

Reasoning

Testing

Note

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Repeatedly remove a substring quickly

3 Answers 3

Reasoning

Testing

Note

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions