Searching text for a string

Question 1

This is my current code:

def search(word,source):
 places = []
 for x in range(0,len(source)):
 if source[x] == word[0]:
 if source[x:x+len(word)] == word:
 places.append(x)
 return places

How would I optimize this without using regular expressions?

Question 2

I would also use yield so you can avoid creating a list and execute the appropriate code.

Question 3

Using timeit to time the function running over 1000 iterations, as follows:

import timeit
def search(word=None,source=None):
 word = "is"
 source = "Lorem Ipsum is simply dummy text"
 places = []
 for x in range(0, len(source)):
 if source[x] == word[0]:
 if source[x:x + len(word)] == word:
 places.append(x)
 return places
print timeit.timeit(search, number=1000)

This code gives us the following results (over 4 runs):

0.259000062943
0.12299990654
0.174999952316
0.0520000457764

So, we'll use that as our baseline for improvement (further runs stay constant around .052/1000).

Let's grab the low hanging fruit - caching the len calls:

import timeit
def search(word=None,source=None):
 word = "is"
 source = "Lorem Ipsum is simply dummy text"
 places = []
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x] == word[0]:
 if source[x:x + len_word] == word:
 places.append(x)
 return places
print timeit.timeit(search, number=1000)

Improvement:

0.0519998493195
0.047000169754
0.0499999523163
0.0490000247955

Now, what happens if we get rid of places and just return once we find the string, as follows:

import timeit
def search(word=None,source=None):
 word = "is"
 source = "Lorem Ipsum is simply dummy text"
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x] == word[0]:
 if source[x:x + len_word] == word:
 return x
print timeit.timeit(search, number=1000)

Well, this happened:

0.0349998474121
0.0299999713898
0.0279998779297
0.0279998779297

Nice! But wait, there's more! We can just use Python built ins:

import timeit
def search(word=None,source=None):
 word = "is"
 source = "Lorem Ipsum is simply dummy text"
 return source.lower().find(word.lower())
print timeit.timeit(search, number=1000)

to halve the time it took us with the brute force approach:

0.0140001773834
0.0149999599457
0.0120000839233
0.0110000114441

These are the two final products, one brute force, and one using built ins, and both return the same int value when a word is found:

def search(word,source):
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x] == word[0]:
 if source[x:x + len_word] == word:
 return x
def search(word,source):
 return source.lower().find(word.lower())

::EDIT::

In regards to the comment (not same as OPs code), turn the function into an iterator rather than returning a list. This will give you the speed boost of the prior code, and still give you all the places of occurrence:

def search(word,source):
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x] == word[0]:
 if source[x:x + len_word] == word:
 yield x
for occurrence in search():
 print occurrence

EDIT:::

Getting rid of if source[x] == word[0] also gives us a nice speedup:

def search(word,source):
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x:x + len_word] == word:
 yield x

EDIT:::

Another quick win, caching the first chars:

def search(word,source):
 places = []
 len_word = len(word)
 len_source = len(source)
 first_char = word[0]
 for x in range(0, len_source):
 if source[x] == first_char:
 if source[x:x + len_word] == word:
 places.append(x)
 return places

Question 4

Just returning the first match is not the same as the OP's code - although it does give a nice speed-up! Also, len(source) is only calculated once either way.

Question 5

@jonrsharpe Good catch! I've edited my answer to reflect your comment and a solution that maintains the speed up, although it converts the function into a generator so a loop is needed externally.

Question 6

Strangely enough, I added back the places and append, but this algorithm is now significantly slower. What is wrong?

Question 7

@LiamSchumm there is overhead involved in a generator - the advantage is in not building the whole list at once (space saving), but if you want the whole list it will likely not be faster than a list comprehension (and if you do want to do it iteratively, look into itertools).

Question 8

@jonrsharpe In which case, my code with append() compares in time to your code with enumerate(): repl.it/4rh - both hovering around 0.45 per 10k iterations.

Question 9

Using timeit, we can make changes and see how it affects the speed. Baseline:

>>> def search1(word,source):
 places = []
 for x in range(0,len(source)):
 if source[x] == word[0]:
 if source[x:x+len(word)] == word:
 places.append(x)
 return places
>>> timeit.timeit("search1(word, source)", 
 setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000", 
 number=10000)
15.60436644058882

Note the following:

You calculate word[0] and len(word) on every loop, these can be factored out;
A list comprehension is generally faster than for looping and appending; and
enumerate allows you to get the index and character from source simultaneously.

With those in mind:

>>> def search2(word, source):
 """Find every index in source at which word starts."""
 firstchar = word[0]
 wordlen = len(word)
 return [index for index, char in enumerate(source) 
 if firstchar == char and word == source[index:index+wordlen]]
>>> timeit.timeit("search2(word, source)", 
 setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000", 
 number=10000)
9.99040215947457

I have also added a docstring to explain what the function does.

However, if you really want efficiency (and readability, for that matter), ignoring regular expressions is a bad move:

>>> def search3(word, source):
 """Find every index in source at which word starts."""
 return [m.start() for m in re.finditer(word, source)]
>>> timeit.timeit("search3(word, source)", 
 setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000", 
 number=10000)
2.1652778798459167

Question 10

This is an inefficient way to do that without regex. But not very different from your version.

>>> string = "test test test test"
>>> [i for i in range(len(string)) if string.startswith('test', i)]
[0, 5, 10, 15]

What's the problem in using regex? It's a built-in module. I would use regex because it's often the most efficient way to do more complex things like unpredictable lengths of inputs. Also regarding runtime. For example right now you are also searching for overlapping solutions. Do you want that? If you will search for long words and you don't want to find overlapping solutions you can skip a lot of iterations. Regex will do that automatically.

>>> [m.start() for m in re.finditer('test', 'test test test test')]
[0, 5, 10, 15]

If you want to find overlapping matches, lookahead will do that:

>>> [m.start() for m in re.finditer('(?=tt)', 'ttt')]
[0, 1]

Also look into this.

I also tried just optimize your solution and ended up with

>>> def search(word, source):
... w_len = len(word)
... for x in range(0, len(source)):
... if source[x:x+w_len] == word:
... yield x
...
>>> [i for i in search('foo', 'foobarbazfoobarbaz')]
[0, 9]

jsanc623 jsanc623 2,86416 silver badges22 bronze badges · Accepted Answer · 2014-11-24 17:21:25Z

Using timeit to time the function running over 1000 iterations, as follows:

import timeit
def search(word=None,source=None):
 word = "is"
 source = "Lorem Ipsum is simply dummy text"
 places = []
 for x in range(0, len(source)):
 if source[x] == word[0]:
 if source[x:x + len(word)] == word:
 places.append(x)
 return places
print timeit.timeit(search, number=1000)

This code gives us the following results (over 4 runs):

0.259000062943
0.12299990654
0.174999952316
0.0520000457764

So, we'll use that as our baseline for improvement (further runs stay constant around .052/1000).

Let's grab the low hanging fruit - caching the len calls:

import timeit
def search(word=None,source=None):
 word = "is"
 source = "Lorem Ipsum is simply dummy text"
 places = []
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x] == word[0]:
 if source[x:x + len_word] == word:
 places.append(x)
 return places
print timeit.timeit(search, number=1000)

Improvement:

0.0519998493195
0.047000169754
0.0499999523163
0.0490000247955

Now, what happens if we get rid of places and just return once we find the string, as follows:

import timeit
def search(word=None,source=None):
 word = "is"
 source = "Lorem Ipsum is simply dummy text"
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x] == word[0]:
 if source[x:x + len_word] == word:
 return x
print timeit.timeit(search, number=1000)

Well, this happened:

0.0349998474121
0.0299999713898
0.0279998779297
0.0279998779297

Nice! But wait, there's more! We can just use Python built ins:

import timeit
def search(word=None,source=None):
 word = "is"
 source = "Lorem Ipsum is simply dummy text"
 return source.lower().find(word.lower())
print timeit.timeit(search, number=1000)

to halve the time it took us with the brute force approach:

0.0140001773834
0.0149999599457
0.0120000839233
0.0110000114441

These are the two final products, one brute force, and one using built ins, and both return the same int value when a word is found:

def search(word,source):
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x] == word[0]:
 if source[x:x + len_word] == word:
 return x
def search(word,source):
 return source.lower().find(word.lower())

::EDIT::

In regards to the comment (not same as OPs code), turn the function into an iterator rather than returning a list. This will give you the speed boost of the prior code, and still give you all the places of occurrence:

def search(word,source):
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x] == word[0]:
 if source[x:x + len_word] == word:
 yield x
for occurrence in search():
 print occurrence

EDIT:::

Getting rid of if source[x] == word[0] also gives us a nice speedup:

def search(word,source):
 len_word = len(word)
 len_source = len(source)
 for x in range(0, len_source):
 if source[x:x + len_word] == word:
 yield x

EDIT:::

Another quick win, caching the first chars:

def search(word,source):
 places = []
 len_word = len(word)
 len_source = len(source)
 first_char = word[0]
 for x in range(0, len_source):
 if source[x] == first_char:
 if source[x:x + len_word] == word:
 places.append(x)
 return places

Just returning the first match is not the same as the OP's code - although it does give a nice speed-up! Also, len(source) is only calculated once either way.
@jonrsharpe Good catch! I've edited my answer to reflect your comment and a solution that maintains the speed up, although it converts the function into a generator so a loop is needed externally.
Strangely enough, I added back the places and append, but this algorithm is now significantly slower. What is wrong?
@LiamSchumm there is overhead involved in a generator - the advantage is in not building the whole list at once (space saving), but if you want the whole list it will likely not be faster than a list comprehension (and if you do want to do it iteratively, look into itertools).
@jonrsharpe In which case, my code with append() compares in time to your code with enumerate(): repl.it/4rh - both hovering around 0.45 per 10k iterations.

Stack Exchange Network

Searching text for a string

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Searching text for a string

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions