3
\$\begingroup\$

This is my current code:

def search(word,source):
 places = []
 for x in range(0,len(source)):
 if source[x] == word[0]:
 if source[x:x+len(word)] == word:
 places.append(x)
 return places 

How would I optimize this without using regular expressions?

jonrsharpe
14k2 gold badges36 silver badges62 bronze badges
asked Nov 24, 2014 at 16:43
\$\endgroup\$
1
  • 1
    \$\begingroup\$ I would also use yield so you can avoid creating a list and execute the appropriate code. \$\endgroup\$ Commented Nov 24, 2014 at 19:32

3 Answers 3

2
\$\begingroup\$

Using timeit to time the function running over 1000 iterations, as follows:

import timeit
def search(word=None,source=None):
 word = "is"
 source = "Lorem Ipsum is simply dummy text"
 places = []
 for x in range(0, len(source)):
 if source[x] == word[0]:
 if source[x:x + len(word)] == word:
 places.append(x)
 return places
print timeit.timeit(search, number=1000)

This code gives us the following results (over 4 runs):

0.259000062943
0.12299990654
0.174999952316
0.0520000457764

So, we'll use that as our baseline for improvement (further runs stay constant around .052/1000).

Let's grab the low hanging fruit - caching the len calls:

import timeit
def search(word=None,source=None):
 word = "is"
 source = "Lorem Ipsum is simply dummy text"
 places = []
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x] == word[0]:
 if source[x:x + len_word] == word:
 places.append(x)
 return places
print timeit.timeit(search, number=1000)

Improvement:

0.0519998493195
0.047000169754
0.0499999523163
0.0490000247955

Now, what happens if we get rid of places and just return once we find the string, as follows:

import timeit
def search(word=None,source=None):
 word = "is"
 source = "Lorem Ipsum is simply dummy text"
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x] == word[0]:
 if source[x:x + len_word] == word:
 return x
print timeit.timeit(search, number=1000)

Well, this happened:

0.0349998474121
0.0299999713898
0.0279998779297
0.0279998779297

Nice! But wait, there's more! We can just use Python built ins:

import timeit
def search(word=None,source=None):
 word = "is"
 source = "Lorem Ipsum is simply dummy text"
 return source.lower().find(word.lower())
print timeit.timeit(search, number=1000)

to halve the time it took us with the brute force approach:

0.0140001773834
0.0149999599457
0.0120000839233
0.0110000114441

These are the two final products, one brute force, and one using built ins, and both return the same int value when a word is found:

def search(word,source):
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x] == word[0]:
 if source[x:x + len_word] == word:
 return x
def search(word,source):
 return source.lower().find(word.lower())

::EDIT::

In regards to the comment (not same as OPs code), turn the function into an iterator rather than returning a list. This will give you the speed boost of the prior code, and still give you all the places of occurrence:

def search(word,source):
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x] == word[0]:
 if source[x:x + len_word] == word:
 yield x
for occurrence in search():
 print occurrence 

EDIT:::

Getting rid of if source[x] == word[0] also gives us a nice speedup:

def search(word,source):
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x:x + len_word] == word:
 yield x

EDIT:::

Another quick win, caching the first chars:

def search(word,source):
 places = []
 len_word = len(word)
 len_source = len(source)
 first_char = word[0]
 for x in range(0, len_source):
 if source[x] == first_char:
 if source[x:x + len_word] == word:
 places.append(x)
 return places
answered Nov 24, 2014 at 17:21
\$\endgroup\$
11
  • 1
    \$\begingroup\$ Just returning the first match is not the same as the OP's code - although it does give a nice speed-up! Also, len(source) is only calculated once either way. \$\endgroup\$ Commented Nov 24, 2014 at 17:22
  • \$\begingroup\$ @jonrsharpe Good catch! I've edited my answer to reflect your comment and a solution that maintains the speed up, although it converts the function into a generator so a loop is needed externally. \$\endgroup\$ Commented Nov 24, 2014 at 17:32
  • \$\begingroup\$ Strangely enough, I added back the places and append, but this algorithm is now significantly slower. What is wrong? \$\endgroup\$ Commented Nov 24, 2014 at 17:47
  • \$\begingroup\$ @LiamSchumm there is overhead involved in a generator - the advantage is in not building the whole list at once (space saving), but if you want the whole list it will likely not be faster than a list comprehension (and if you do want to do it iteratively, look into itertools). \$\endgroup\$ Commented Nov 24, 2014 at 17:48
  • \$\begingroup\$ @jonrsharpe In which case, my code with append() compares in time to your code with enumerate(): repl.it/4rh - both hovering around 0.45 per 10k iterations. \$\endgroup\$ Commented Nov 24, 2014 at 18:05
5
\$\begingroup\$

Using timeit, we can make changes and see how it affects the speed. Baseline:

>>> def search1(word,source):
 places = []
 for x in range(0,len(source)):
 if source[x] == word[0]:
 if source[x:x+len(word)] == word:
 places.append(x)
 return places
>>> timeit.timeit("search1(word, source)", 
 setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000", 
 number=10000)
15.60436644058882

Note the following:

  • You calculate word[0] and len(word) on every loop, these can be factored out;
  • A list comprehension is generally faster than for looping and appending; and
  • enumerate allows you to get the index and character from source simultaneously.

With those in mind:

>>> def search2(word, source):
 """Find every index in source at which word starts."""
 firstchar = word[0]
 wordlen = len(word)
 return [index for index, char in enumerate(source) 
 if firstchar == char and word == source[index:index+wordlen]]
>>> timeit.timeit("search2(word, source)", 
 setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000", 
 number=10000)
9.99040215947457

I have also added a docstring to explain what the function does.


However, if you really want efficiency (and readability, for that matter), ignoring regular expressions is a bad move:

>>> def search3(word, source):
 """Find every index in source at which word starts."""
 return [m.start() for m in re.finditer(word, source)]
>>> timeit.timeit("search3(word, source)", 
 setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000", 
 number=10000)
2.1652778798459167
answered Nov 24, 2014 at 17:19
\$\endgroup\$
0
2
\$\begingroup\$

This is an inefficient way to do that without regex. But not very different from your version.

>>> string = "test test test test"
>>> [i for i in range(len(string)) if string.startswith('test', i)]
[0, 5, 10, 15]

What's the problem in using regex? It's a built-in module. I would use regex because it's often the most efficient way to do more complex things like unpredictable lengths of inputs. Also regarding runtime. For example right now you are also searching for overlapping solutions. Do you want that? If you will search for long words and you don't want to find overlapping solutions you can skip a lot of iterations. Regex will do that automatically.

>>> [m.start() for m in re.finditer('test', 'test test test test')]
[0, 5, 10, 15]

If you want to find overlapping matches, lookahead will do that:

>>> [m.start() for m in re.finditer('(?=tt)', 'ttt')]
[0, 1]

Also look into this.

I also tried just optimize your solution and ended up with

>>> def search(word, source):
... w_len = len(word)
... for x in range(0, len(source)):
... if source[x:x+w_len] == word:
... yield x
...
>>> [i for i in search('foo', 'foobarbazfoobarbaz')]
[0, 9]
answered Nov 24, 2014 at 19:29
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.