That's my current situation:
- I have a 2.5MB text file with around 250k strings, sorted alphabetically
- Each string is unique
- I don't need to modify the entries in the text file: once the text file is loaded, it is never edited
- The text file is loaded at start and then I just need to search for strings through it
The last point is the problem. Actually I need to search for complete match AND for partial match of the string. The algorithm I wrote just involved the use of regular expressions combined with some attempts to make the process faster: for example, I hardcoded into my script the indexes of the dictionary that identified the singular letters of the alphabet, and then split the big text file fictionary into 26 smaller dictionary. That was totally useless, the script is still incredibly slow. Skimming some posts here, I was convinced to try mmap: but it looked useless to find all the partial matches, given a regular expression. Eventually I came to the conclusion that a trie may solve my problem, though i hardly know what is this. Should I go with tries? If so, how should I proceed to the creation of a trie in python? Is marisa-trie module good? Thanks to everybody
EDIT: By "partial match", I mean that I have the prefix of a string. I do not need matches at the end or in the middle, just at the beginning.
-
Please go into more detail on what you mean by partial match.James Thiele– James Thiele2013年02月22日 22:52:25 +00:00Commented Feb 22, 2013 at 22:52
-
Yes, is a partial match a prefix of each string, or does it include any substring of each string? Building a trie will not help if matches need to be substrings.silverjam– silverjam2013年02月22日 22:55:18 +00:00Commented Feb 22, 2013 at 22:55
-
Trie won't help you if you need to match in the middle or at the end of the string:arainchi– arainchi2013年02月22日 22:55:50 +00:00Commented Feb 22, 2013 at 22:55
-
@user1068051 You could try the whoosh library. You can search for exact strings or wildcard searchesNoel Evans– Noel Evans2013年02月22日 23:03:19 +00:00Commented Feb 22, 2013 at 23:03
6 Answers 6
Easiest and fastest solution:
#!/usr/bin/env python
d = {}
# open your file here, i'm using /etc/hosts as an example...
f = open("/etc/hosts","r")
for line in f:
line = line.rstrip()
l = len(line)+1
for i in xrange(1,l):
d[line[:i]] = True
f.close()
while True:
w = raw_input('> ')
if not w:
break
if w in d:
print "match found", w
Here is slightly more complex, but memory efficient one:
#!/usr/bin/env python
d = []
def binary_search(a, x, lo=0, hi=None):
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
midval = a[mid]
if midval < x:
lo = mid+1
elif midval > x:
hi = mid
else:
return mid
return -1
f = open("/etc/hosts","r")
for line in f:
line=line.rstrip()
l = len(line)+1
for i in xrange(1,l):
x = hash(line[:i])
d.append(x)
f.close()
d.sort()
while True:
w = raw_input('> ')
if not w:
break
if binary_search(d, hash(w)) != -1:
print "match found", w
2 Comments
Since the file is already sorted and read in, you can use a binary search on it without needing to resort to any fancy data structures. Python has a binary search function built in, bisect.bisect_left`.
Comments
Use a trie.
#dictionary is a list of words
def parse_dictionary(dictionary):
dictionary_trie = {}
for word in dictionary:
tmp_trie = dictionary_trie
for letter in word:
if letter not in tmp_trie:
tmp_trie[letter] = {}
if 'words' not in tmp_trie[letter]:
tmp_trie[letter]['words'] = []
tmp_trie[letter]['words'].append(word)
tmp_trie = tmp_trie[letter]
return dictionary_trie
def matches(substring, trie):
d = trie
for letter in substring:
try:
d = d[letter]
except KeyError:
return []
return d['words']
Usage example:
>>> import pprint
>>> dictionary = ['test', 'testing', 'hello', 'world', 'hai']
>>> trie = parse_dictionary(dictionary)
>>> pprint.pprint(trie)
{'h': {'a': {'i': {'words': ['hai']}, 'words': ['hai']},
'e': {'l': {'l': {'o': {'words': ['hello']}, 'words': ['hello']},
'words': ['hello']},
'words': ['hello']},
'words': ['hello', 'hai']},
't': {'e': {'s': {'t': {'i': {'n': {'g': {'words': ['testing']},
'words': ['testing']},
'words': ['testing']},
'words': ['test', 'testing']},
'words': ['test', 'testing']},
'words': ['test', 'testing']},
'words': ['test', 'testing']},
'w': {'o': {'r': {'l': {'d': {'words': ['world']}, 'words': ['world']},
'words': ['world']},
'words': ['world']},
'words': ['world']}}
>>> matches('h', trie)
['hello', 'hai']
>>> matches('he', trie)
['hello']
>>> matches('asd', trie)
[]
>>> matches('test', trie)
['test', 'testing']
>>>
Comments
You could make a list, let each line be one element of the list and do a binary search.
1 Comment
So to explain arainchi's very nice answer, make a dictionary with an entry for every line in your file. Then you can match your search string against the names of those entries. Dictionaries are really handy for this kind of searching.
Comments
Using a trie still requires you to build a trie, which is O(n) to iterate the whole file-- taking advantage of the sorting will make it O(log_2 n). So, this faster solution would use a binary search (see below).
This solution still requires you to read in the entire file. In an even faster solution, you could pre-process the file and pad out all the lines so they're the same length (or build some kind of index structure in the file, to make seeking to the middle of the list feasible)-- then seeking to middle of the file would take you to the middle of the list. The "even faster" solution would probably only be needed for a really, really big file (gigabytes or hundreds of megabytes). You'd them combine this with the binary search.
Possibly, if the file-system supports sparse files-- doing the above padding scheme won't event increase the files actual blocks used on the disk.
Then, at that point, you're probably approaching a b-tree or a b+tree implementation to make the indexing efficient. So you could use a b-tree library.
Something like this:
import bisect
entries = ["a", "b", "c", "cc", "cd", "ce", "d", "e", "f" ]
def find_matches(ls, m):
x = len(ls) / 2
match_index = -1
index = bisect.bisect_left(ls, m)
matches = []
while ls[index].startswith(m):
matches.append(ls[index])
index += 1
return matches
print find_matches(entries, "c")
Output:
>>> ['c', 'cc', 'cd', 'ce']