This is my current code:
def search(word,source):
places = []
for x in range(0,len(source)):
if source[x] == word[0]:
if source[x:x+len(word)] == word:
places.append(x)
return places
How would I optimize this without using regular expressions?
-
1\$\begingroup\$ I would also use yield so you can avoid creating a list and execute the appropriate code. \$\endgroup\$wenzul– wenzul2014年11月24日 19:32:20 +00:00Commented Nov 24, 2014 at 19:32
3 Answers 3
Using timeit
to time the function running over 1000 iterations, as follows:
import timeit
def search(word=None,source=None):
word = "is"
source = "Lorem Ipsum is simply dummy text"
places = []
for x in range(0, len(source)):
if source[x] == word[0]:
if source[x:x + len(word)] == word:
places.append(x)
return places
print timeit.timeit(search, number=1000)
This code gives us the following results (over 4 runs):
0.259000062943
0.12299990654
0.174999952316
0.0520000457764
So, we'll use that as our baseline for improvement (further runs stay constant around .052/1000).
Let's grab the low hanging fruit - caching the len
calls:
import timeit
def search(word=None,source=None):
word = "is"
source = "Lorem Ipsum is simply dummy text"
places = []
len_word = len(word)
len_source = len(source)
for x in range(0, len_source):
if source[x] == word[0]:
if source[x:x + len_word] == word:
places.append(x)
return places
print timeit.timeit(search, number=1000)
Improvement:
0.0519998493195
0.047000169754
0.0499999523163
0.0490000247955
Now, what happens if we get rid of places
and just return once we find the string, as follows:
import timeit
def search(word=None,source=None):
word = "is"
source = "Lorem Ipsum is simply dummy text"
len_word = len(word)
len_source = len(source)
for x in range(0, len_source):
if source[x] == word[0]:
if source[x:x + len_word] == word:
return x
print timeit.timeit(search, number=1000)
Well, this happened:
0.0349998474121
0.0299999713898
0.0279998779297
0.0279998779297
Nice! But wait, there's more! We can just use Python built ins:
import timeit
def search(word=None,source=None):
word = "is"
source = "Lorem Ipsum is simply dummy text"
return source.lower().find(word.lower())
print timeit.timeit(search, number=1000)
to halve the time it took us with the brute force approach:
0.0140001773834
0.0149999599457
0.0120000839233
0.0110000114441
These are the two final products, one brute force, and one using built ins, and both return the same int
value when a word is found:
def search(word,source):
len_word = len(word)
len_source = len(source)
for x in range(0, len_source):
if source[x] == word[0]:
if source[x:x + len_word] == word:
return x
def search(word,source):
return source.lower().find(word.lower())
::EDIT::
In regards to the comment (not same as OPs code), turn the function into an iterator rather than returning a list. This will give you the speed boost of the prior code, and still give you all the places of occurrence:
def search(word,source):
len_word = len(word)
len_source = len(source)
for x in range(0, len_source):
if source[x] == word[0]:
if source[x:x + len_word] == word:
yield x
for occurrence in search():
print occurrence
EDIT:::
Getting rid of if source[x] == word[0]
also gives us a nice speedup:
def search(word,source):
len_word = len(word)
len_source = len(source)
for x in range(0, len_source):
if source[x:x + len_word] == word:
yield x
EDIT:::
Another quick win, caching the first chars:
def search(word,source):
places = []
len_word = len(word)
len_source = len(source)
first_char = word[0]
for x in range(0, len_source):
if source[x] == first_char:
if source[x:x + len_word] == word:
places.append(x)
return places
-
1\$\begingroup\$ Just returning the first match is not the same as the OP's code - although it does give a nice speed-up! Also,
len(source)
is only calculated once either way. \$\endgroup\$jonrsharpe– jonrsharpe2014年11月24日 17:22:40 +00:00Commented Nov 24, 2014 at 17:22 -
\$\begingroup\$ @jonrsharpe Good catch! I've edited my answer to reflect your comment and a solution that maintains the speed up, although it converts the function into a generator so a loop is needed externally. \$\endgroup\$jsanc623– jsanc6232014年11月24日 17:32:02 +00:00Commented Nov 24, 2014 at 17:32
-
\$\begingroup\$ Strangely enough, I added back the places and append, but this algorithm is now significantly slower. What is wrong? \$\endgroup\$Liam Schumm– Liam Schumm2014年11月24日 17:47:26 +00:00Commented Nov 24, 2014 at 17:47
-
\$\begingroup\$ @LiamSchumm there is overhead involved in a generator - the advantage is in not building the whole list at once (space saving), but if you want the whole list it will likely not be faster than a list comprehension (and if you do want to do it iteratively, look into
itertools
). \$\endgroup\$jonrsharpe– jonrsharpe2014年11月24日 17:48:41 +00:00Commented Nov 24, 2014 at 17:48 -
\$\begingroup\$ @jonrsharpe In which case, my code with
append()
compares in time to your code withenumerate()
: repl.it/4rh - both hovering around0.45
per 10k iterations. \$\endgroup\$jsanc623– jsanc6232014年11月24日 18:05:23 +00:00Commented Nov 24, 2014 at 18:05
Using timeit
, we can make changes and see how it affects the speed. Baseline:
>>> def search1(word,source):
places = []
for x in range(0,len(source)):
if source[x] == word[0]:
if source[x:x+len(word)] == word:
places.append(x)
return places
>>> timeit.timeit("search1(word, source)",
setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000",
number=10000)
15.60436644058882
Note the following:
- You calculate
word[0]
andlen(word)
on every loop, these can be factored out; - A list comprehension is generally faster than
for
looping andappend
ing; and enumerate
allows you to get the index and character fromsource
simultaneously.
With those in mind:
>>> def search2(word, source):
"""Find every index in source at which word starts."""
firstchar = word[0]
wordlen = len(word)
return [index for index, char in enumerate(source)
if firstchar == char and word == source[index:index+wordlen]]
>>> timeit.timeit("search2(word, source)",
setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000",
number=10000)
9.99040215947457
I have also added a docstring to explain what the function does.
However, if you really want efficiency (and readability, for that matter), ignoring regular expressions is a bad move:
>>> def search3(word, source):
"""Find every index in source at which word starts."""
return [m.start() for m in re.finditer(word, source)]
>>> timeit.timeit("search3(word, source)",
setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000",
number=10000)
2.1652778798459167
This is an inefficient way to do that without regex. But not very different from your version.
>>> string = "test test test test"
>>> [i for i in range(len(string)) if string.startswith('test', i)]
[0, 5, 10, 15]
What's the problem in using regex? It's a built-in module. I would use regex because it's often the most efficient way to do more complex things like unpredictable lengths of inputs. Also regarding runtime. For example right now you are also searching for overlapping solutions. Do you want that? If you will search for long words and you don't want to find overlapping solutions you can skip a lot of iterations. Regex will do that automatically.
>>> [m.start() for m in re.finditer('test', 'test test test test')]
[0, 5, 10, 15]
If you want to find overlapping matches, lookahead will do that:
>>> [m.start() for m in re.finditer('(?=tt)', 'ttt')]
[0, 1]
Also look into this.
I also tried just optimize your solution and ended up with
>>> def search(word, source):
... w_len = len(word)
... for x in range(0, len(source)):
... if source[x:x+w_len] == word:
... yield x
...
>>> [i for i in search('foo', 'foobarbazfoobarbaz')]
[0, 9]