Fast method to search for a string in a big text file with python

Question 1

That's my current situation:

I have a 2.5MB text file with around 250k strings, sorted alphabetically
Each string is unique
I don't need to modify the entries in the text file: once the text file is loaded, it is never edited
The text file is loaded at start and then I just need to search for strings through it

The last point is the problem. Actually I need to search for complete match AND for partial match of the string. The algorithm I wrote just involved the use of regular expressions combined with some attempts to make the process faster: for example, I hardcoded into my script the indexes of the dictionary that identified the singular letters of the alphabet, and then split the big text file fictionary into 26 smaller dictionary. That was totally useless, the script is still incredibly slow. Skimming some posts here, I was convinced to try mmap: but it looked useless to find all the partial matches, given a regular expression. Eventually I came to the conclusion that a trie may solve my problem, though i hardly know what is this. Should I go with tries? If so, how should I proceed to the creation of a trie in python? Is marisa-trie module good? Thanks to everybody

EDIT: By "partial match", I mean that I have the prefix of a string. I do not need matches at the end or in the middle, just at the beginning.

Question 2

Please go into more detail on what you mean by partial match.

Question 3

Yes, is a partial match a prefix of each string, or does it include any substring of each string? Building a trie will not help if matches need to be substrings.

Question 4

Trie won't help you if you need to match in the middle or at the end of the string:

Question 5

@user1068051 You could try the whoosh library. You can search for exact strings or wildcard searches

Question 6

Easiest and fastest solution:

#!/usr/bin/env python
d = {}
# open your file here, i'm using /etc/hosts as an example...
f = open("/etc/hosts","r")
for line in f:
 line = line.rstrip()
 l = len(line)+1
 for i in xrange(1,l):
 d[line[:i]] = True
f.close()
while True:
 w = raw_input('> ')
 if not w:
 break
 if w in d:
 print "match found", w

Here is slightly more complex, but memory efficient one:

#!/usr/bin/env python
d = []
def binary_search(a, x, lo=0, hi=None):
 if hi is None:
 hi = len(a)
 while lo < hi:
 mid = (lo+hi)//2
 midval = a[mid]
 if midval < x:
 lo = mid+1
 elif midval > x:
 hi = mid
 else:
 return mid
 return -1
f = open("/etc/hosts","r")
for line in f:
 line=line.rstrip()
 l = len(line)+1
 for i in xrange(1,l):
 x = hash(line[:i])
 d.append(x)
f.close()
d.sort()
while True:
 w = raw_input('> ')
 if not w:
 break
 if binary_search(d, hash(w)) != -1:
 print "match found", w

Question 7

Thanks a lot! I couldn't choose to use a dictionary because i needed the strings to be sorted alphabetically - which is not possible with a dict. I will try your last solution

Question 8

Instead of dict, you may try with OrderedDict module. It may help in first solution too. :)

Question 9

Since the file is already sorted and read in, you can use a binary search on it without needing to resort to any fancy data structures. Python has a binary search function built in, bisect.bisect_left`.

Question 10

Use a trie.

#dictionary is a list of words
def parse_dictionary(dictionary):
 dictionary_trie = {}
 for word in dictionary:
 tmp_trie = dictionary_trie
 for letter in word:
 if letter not in tmp_trie:
 tmp_trie[letter] = {}
 if 'words' not in tmp_trie[letter]:
 tmp_trie[letter]['words'] = []
 tmp_trie[letter]['words'].append(word)
 tmp_trie = tmp_trie[letter]
 return dictionary_trie
def matches(substring, trie):
 d = trie
 for letter in substring:
 try:
 d = d[letter]
 except KeyError:
 return []
 return d['words']

Usage example:

>>> import pprint
>>> dictionary = ['test', 'testing', 'hello', 'world', 'hai']
>>> trie = parse_dictionary(dictionary)
>>> pprint.pprint(trie)
{'h': {'a': {'i': {'words': ['hai']}, 'words': ['hai']},
 'e': {'l': {'l': {'o': {'words': ['hello']}, 'words': ['hello']},
 'words': ['hello']},
 'words': ['hello']},
 'words': ['hello', 'hai']},
 't': {'e': {'s': {'t': {'i': {'n': {'g': {'words': ['testing']},
 'words': ['testing']},
 'words': ['testing']},
 'words': ['test', 'testing']},
 'words': ['test', 'testing']},
 'words': ['test', 'testing']},
 'words': ['test', 'testing']},
 'w': {'o': {'r': {'l': {'d': {'words': ['world']}, 'words': ['world']},
 'words': ['world']},
 'words': ['world']},
 'words': ['world']}}
>>> matches('h', trie)
['hello', 'hai']
>>> matches('he', trie)
['hello']
>>> matches('asd', trie)
[]
>>> matches('test', trie)
['test', 'testing']
>>>

Question 11

You could make a list, let each line be one element of the list and do a binary search.

Question 12

@nhahtdh, the question specifically states that the text to be found is a prefix i.e. at the start of the line.

Question 13

So to explain arainchi's very nice answer, make a dictionary with an entry for every line in your file. Then you can match your search string against the names of those entries. Dictionaries are really handy for this kind of searching.

Question 14

Using a trie still requires you to build a trie, which is O(n) to iterate the whole file-- taking advantage of the sorting will make it O(log_2 n). So, this faster solution would use a binary search (see below).

This solution still requires you to read in the entire file. In an even faster solution, you could pre-process the file and pad out all the lines so they're the same length (or build some kind of index structure in the file, to make seeking to the middle of the list feasible)-- then seeking to middle of the file would take you to the middle of the list. The "even faster" solution would probably only be needed for a really, really big file (gigabytes or hundreds of megabytes). You'd them combine this with the binary search.

Possibly, if the file-system supports sparse files-- doing the above padding scheme won't event increase the files actual blocks used on the disk.

Then, at that point, you're probably approaching a b-tree or a b+tree implementation to make the indexing efficient. So you could use a b-tree library.

Something like this:

import bisect
entries = ["a", "b", "c", "cc", "cd", "ce", "d", "e", "f" ]
def find_matches(ls, m):
 x = len(ls) / 2
 match_index = -1
 index = bisect.bisect_left(ls, m)
 matches = []
 while ls[index].startswith(m):
 matches.append(ls[index])
 index += 1
 return matches
print find_matches(entries, "c")

Output:

>>> ['c', 'cc', 'cd', 'ce']

arainchi arainchi 1,50215 silver badges14 bronze badges · Accepted Answer · 2013-02-22 23:15:02Z

Easiest and fastest solution:

#!/usr/bin/env python
d = {}
# open your file here, i'm using /etc/hosts as an example...
f = open("/etc/hosts","r")
for line in f:
 line = line.rstrip()
 l = len(line)+1
 for i in xrange(1,l):
 d[line[:i]] = True
f.close()
while True:
 w = raw_input('> ')
 if not w:
 break
 if w in d:
 print "match found", w

Here is slightly more complex, but memory efficient one:

#!/usr/bin/env python
d = []
def binary_search(a, x, lo=0, hi=None):
 if hi is None:
 hi = len(a)
 while lo < hi:
 mid = (lo+hi)//2
 midval = a[mid]
 if midval < x:
 lo = mid+1
 elif midval > x:
 hi = mid
 else:
 return mid
 return -1
f = open("/etc/hosts","r")
for line in f:
 line=line.rstrip()
 l = len(line)+1
 for i in xrange(1,l):
 x = hash(line[:i])
 d.append(x)
f.close()
d.sort()
while True:
 w = raw_input('> ')
 if not w:
 break
 if binary_search(d, hash(w)) != -1:
 print "match found", w

Thanks a lot! I couldn't choose to use a dictionary because i needed the strings to be sorted alphabetically - which is not possible with a dict. I will try your last solution
Instead of dict, you may try with OrderedDict module. It may help in first solution too. :)

CollectivesTM on Stack Overflow

Fast method to search for a string in a big text file with python

6 Answers 6

2 Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

6 Answers 6

2 Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related