I'm trying to solve the USACO problem Censoring (Bronze), which was the first problem for the 2015 February contest. My solution works for some test cases, but then times out for test cases 7-15. I expected it to run with O(n) time complexity because of the while
loop, which should work, but it didn't.
Problem:
Farmer John has purchased a subscription to Good Hooveskeeping magazine for his cows, so they have plenty of material to read while waiting around in the barn during milking sessions. Unfortunately, the latest issue contains a rather inappropriate article on how to cook the perfect steak, which FJ would rather his cows not see (clearly, the magazine is in need of better editorial oversight).
FJ has taken all of the text from the magazine to create the string S of length at most 10^6 characters. From this, he would like to remove occurrences of a substring T of length <= 100 characters to censor the inappropriate content. To do this, Farmer John finds the first occurrence of T in S and deletes it. He then repeats the process again, deleting the first occurrence of T again, continuing until there are no more occurrences of T in S. Note that the deletion of one occurrence might create a new occurrence of T that didn't exist before.
Please help FJ determine the final contents of S after censoring is complete.
INPUT FORMAT: (file censor.in) The first line will contain S. The second line will contain T. The length of T will be at most that of S, and all characters of S and T will be lower-case alphabet characters (in the range a..z).
OUTPUT FORMAT: (file censor.out) The string S after all deletions are complete. It is guaranteed that S will not become empty during the deletion process.
# Censoring (Bronze)
input = open('censor.in', 'r').read().split('\n')
s = input[0]
c = input[1]
l = len(c)
while True:
try:
i = s.index(c)
s = s[:i] + s[i + l:]
except:
break
output = open('censor.out', 'w')
output.write(s)
output.close()
How do I speed up the code? I tried to use Numba, but Numba doesn't work in USACO IDE. This was supposed to solve all the test cases under the time limit, which is 5 seconds.
3 Answers 3
Assumption: len(c)
<< len(s)
.
For example, when censoring "moo".
You asked if one can
speed up python
but really this is a matter of "speeding up an algorithm". That is, you have accidentally fallen into a higher time complexity than necessary.
I expected it to run with O(n) time complexity
Certainly the while
loop is linear in size of article string s
.
But you're restarting that linear .index()
search
from the beginning each time, leading to
nested loop \$O(N ×ばつ M)\$ behavior
if the censored c
appears \$M\$ times within the article.
Given \$M << N\$ an efficient algorithm should see
article size dominate the cost,
with negligible contribution from the censoring operations.
In addition to .index()
being linear in len(s)
,
notice that producing a new immutable str
with string catenation
is also \$O(len(s))\$.
input = open('censor.in', 'r').read().split('\n')
A more natural method to call here would have been .readlines().
Rather than leaving an open file descriptor resource
lying around until end-of-scope,
prefer to use a with
context handler
so it closes when you're done with it.
Shadowing the builtin input()
is definitely not Best Practice.
s = input[0]
c = input[1]
Usually we give 0
and 1
a pass when it comes to calling out
magic numbers.
But here it obscures Author's Intent, there's no need for such cryptic indexes.
Prefer a tuple unpack:
s, c = input
Also, even though these are local variables in
a very short piece of code,
please consider using more informative names
such as article
and censored
.
except:
No.
Do not use "bare" except
,
as it will sometimes swallow exceptions such as KeyboardInterrupt
which we really want to be propagated.
Prefer to habitually write except Exception:
But here we know perfectly well what we're hoping to trap, so spell it out for the Gentle Reader:
except ValueError: # substring not found
i = s.index(c)
Suppose we have a million-character article,
and we've already deleted a thousand instances of "moo".
This .index()
call is super expensive -- it's trolling
through roughly a megabyte of article text that we've
already repeatedly scanned before.
We know there's a giant prefix which is entirely "moo"-free.
.index()
offers an optional start
parameter.
Use it.
You'll have to maintain another temp variable.
Since deleting "moo" from "momooo" can expose another instance,
you should conservatively use j = s.index(c, j) - len(c)
.
This will still yield \$O(len(s))\$ complexity,
given our assumption.
s = s[:i] + s[i + l:]
As noted above, this is linear in len(s)
.
Rather than reading a pair of giant immutable strings
and writing a new giant immutable string, prefer a
tombstone
approach.
Start with this import:
from array import array
Spend \$O(n)\$ time copying article text into an array.
If articles are guaranteed ASCII text then we can
get away with a byte
array, else we need an array of int
s.
Implement your own version of .index()
which
- compares against article text in the array, and
- knows that it must skip over (ignore) any tombstone entries
Now "deleting" a censored string from the article
is a matter of overwriting each character with a
sentinel
tombstone value such as 0
.
Finally at the end you'll need to do linear work with "".join( ... )
to copy out non-sentinel Unicode code points,
producing the final sanitized article that is safe for cows to read.
-
\$\begingroup\$ Can you give me the full code you are talking about? Thanks. \$\endgroup\$user275173– user2751732023年07月28日 18:10:25 +00:00Commented Jul 28, 2023 at 18:10
It timed out for test cases 7-15. I expected it to run with O(n) time complexity because of the while loop, which should work, but it didn't. What's wrong?
Lets look at the Common Sequence Operators:
Operation Result Notes ... ... ... s + t
the concatenation of s and t (6)(7) ...
- Concatenating immutable sequences always results in a new object. This means that building up a sequence by repeated concatenation will have a quadratic runtime cost in the total sequence length. To get a linear runtime cost, you must switch to one of the alternatives below:
- if concatenating
str
objects, you can build a list and usestr.join()
at the end or else write to anio.StringIO
instance and retrieve its value when complete
You should build a list using list.append()
, then build the string with str.join()
.
Note: untested
output: list[str] = []
prev = 0
try:
while True:
curr = s.index(c, prev)
output.append(s[prev:curr])
prev = curr + len(c)
except ValueError:
output.append(s[prev:])
s = "".join(output)
Even better you can actually just use str.replace
.
s = s.replace(c, "")
-
2\$\begingroup\$
prev = curr + len(c)
doesn't account for occurrences appearing after deletion.check = curr - len(c) + 1
needs a whole new logic to assemble the output. \$\endgroup\$greybeard– greybeard2023年07月28日 05:13:02 +00:00Commented Jul 28, 2023 at 5:13 -
1\$\begingroup\$ @greybeard Ah yeah I didn't see "Note that the deletion of one occurrence might create a new occurrence of T that didn't exist before." in the description. A simple approach is to just wrap
s.replace(c, "")
in a while loop until the replace does nothing... A performant approach on the other hand may be a little more intricate to write. \$\endgroup\$2023年07月28日 14:28:18 +00:00Commented Jul 28, 2023 at 14:28 -
\$\begingroup\$ (I've come up with two stacks... and ditched it: to much additional space (especially with Python). A counter for incomplete possible censored words is enough.) \$\endgroup\$greybeard– greybeard2023年07月28日 16:24:56 +00:00Commented Jul 28, 2023 at 16:24
-
\$\begingroup\$ @greybeard I'd probably go with
(s[curr - len(c):curr] + s[curr + len(c):curr + 2*len(c)]).index(c)
in an inner while loop. The algorithm becomes a little more complicated on the whole. I'd usestr.index
over a Trie because Python's so slow I'd expect thestr.index
to perform better, with a lower complexity score. \$\endgroup\$2023年07月28日 17:29:40 +00:00Commented Jul 28, 2023 at 17:29
Essentially, the issue in the original code is that you are constructing a lot of big strings. Each indexing operation is O(C), each rebuilding is O(C), and you may need to rebuild C times. So your original code is O(C2). It may also be helpful for you to separate the solver from the input handling, as I tried to do below:
def solver(string_s, censor_t):
c = len(string_s)
d = len(censor_t)
last_symbol = censor_t[-1]
# emulate a stack; end is the # of symbols in the stack
stack_arr = [None] * c
stack_end = 0
for symbol in string_s:
# emulate pushing symbol to the stack
stack_arr[stack_end] = symbol
stack_end += 1
if stack_end >= d and symbol == last_symbol:
if "".join(stack_arr[(stack_end - d):stack_end]) == censor_t:
# emulate popping d elements from the stack
stack_end -= d
return "".join(stack_arr[:stack_end])
line1, line2 = open('censor.in', 'r').read().split('\n')
output = open('censor.out', 'w')
output.write(solver(line1, line2))
output.close()
Reasoning
Let D be the length of the censor T and let C be the length of the string S. Create an empty list called stack (which is what we’ll treat it as). Add S to stack, 1 character at a time. Check if the last D elements (characters) of stack would form T. If so, pop the last D characters of stack. Repeat until S is completely traversed. Then print out stack. Since each check takes D comparisons and we do at most C checks / adds / removes, this is O(CD) time.
Further improvements:
- only check whenever you add the last character of T to stack.
- only check whenever stack is of size D or greater.
- create a dictionary counting the frequency of each character in stack and only check when stack has enough of each character to make T.
Testing
I put 2000000 'a' and then 1000000 'b' with the censor 'aab'. The initial code takes over 222.49 seconds. The revised takes 0.64 seconds.
output = open("censor2.in", "w")
value = "aab"
output.write(''.join([char*1000000 for char in value]))
output.write("\naab")
output.close()
Note
If we were in a language like C, you could also achieve O(1) space by cannibalizing the original string to make a stack. Do you see how?
-
\$\begingroup\$ Thanks! This also works. Could you show me how to make the original string into a stack in C++? \$\endgroup\$user275173– user2751732023年08月02日 17:27:21 +00:00Commented Aug 2, 2023 at 17:27
-
\$\begingroup\$ Sure, I will give it a shot when I get the time. \$\endgroup\$Justin Chang– Justin Chang2023年08月02日 17:51:16 +00:00Commented Aug 2, 2023 at 17:51
-
\$\begingroup\$ Sorry, I'm having some IDE issues. In C, strings are just arrays of characters. So you could delete the declaration of stack_arr and use string_s whenever you subsequently use stack_arr. The pseudo-stack will never run out of space because string_s initially held all the characters to begin with. \$\endgroup\$Justin Chang– Justin Chang2023年08月10日 00:50:35 +00:00Commented Aug 10, 2023 at 0:50