1

That's my current situation:

  • I have a 2.5MB text file with around 250k strings, sorted alphabetically
  • Each string is unique
  • I don't need to modify the entries in the text file: once the text file is loaded, it is never edited
  • The text file is loaded at start and then I just need to search for strings through it

The last point is the problem. Actually I need to search for complete match AND for partial match of the string. The algorithm I wrote just involved the use of regular expressions combined with some attempts to make the process faster: for example, I hardcoded into my script the indexes of the dictionary that identified the singular letters of the alphabet, and then split the big text file fictionary into 26 smaller dictionary. That was totally useless, the script is still incredibly slow. Skimming some posts here, I was convinced to try mmap: but it looked useless to find all the partial matches, given a regular expression. Eventually I came to the conclusion that a trie may solve my problem, though i hardly know what is this. Should I go with tries? If so, how should I proceed to the creation of a trie in python? Is marisa-trie module good? Thanks to everybody

EDIT: By "partial match", I mean that I have the prefix of a string. I do not need matches at the end or in the middle, just at the beginning.

asked Feb 22, 2013 at 22:48
4
  • Please go into more detail on what you mean by partial match. Commented Feb 22, 2013 at 22:52
  • Yes, is a partial match a prefix of each string, or does it include any substring of each string? Building a trie will not help if matches need to be substrings. Commented Feb 22, 2013 at 22:55
  • Trie won't help you if you need to match in the middle or at the end of the string: Commented Feb 22, 2013 at 22:55
  • @user1068051 You could try the whoosh library. You can search for exact strings or wildcard searches Commented Feb 22, 2013 at 23:03

6 Answers 6

5

Easiest and fastest solution:

#!/usr/bin/env python
d = {}
# open your file here, i'm using /etc/hosts as an example...
f = open("/etc/hosts","r")
for line in f:
 line = line.rstrip()
 l = len(line)+1
 for i in xrange(1,l):
 d[line[:i]] = True
f.close()
while True:
 w = raw_input('> ')
 if not w:
 break
 if w in d:
 print "match found", w

Here is slightly more complex, but memory efficient one:

#!/usr/bin/env python
d = []
def binary_search(a, x, lo=0, hi=None):
 if hi is None:
 hi = len(a)
 while lo < hi:
 mid = (lo+hi)//2
 midval = a[mid]
 if midval < x:
 lo = mid+1
 elif midval > x:
 hi = mid
 else:
 return mid
 return -1
f = open("/etc/hosts","r")
for line in f:
 line=line.rstrip()
 l = len(line)+1
 for i in xrange(1,l):
 x = hash(line[:i])
 d.append(x)
f.close()
d.sort()
while True:
 w = raw_input('> ')
 if not w:
 break
 if binary_search(d, hash(w)) != -1:
 print "match found", w
answered Feb 22, 2013 at 23:15

2 Comments

Thanks a lot! I couldn't choose to use a dictionary because i needed the strings to be sorted alphabetically - which is not possible with a dict. I will try your last solution
Instead of dict, you may try with OrderedDict module. It may help in first solution too. :)
2

Since the file is already sorted and read in, you can use a binary search on it without needing to resort to any fancy data structures. Python has a binary search function built in, bisect.bisect_left`.

answered Feb 22, 2013 at 23:15

Comments

1

Use a trie.

#dictionary is a list of words
def parse_dictionary(dictionary):
 dictionary_trie = {}
 for word in dictionary:
 tmp_trie = dictionary_trie
 for letter in word:
 if letter not in tmp_trie:
 tmp_trie[letter] = {}
 if 'words' not in tmp_trie[letter]:
 tmp_trie[letter]['words'] = []
 tmp_trie[letter]['words'].append(word)
 tmp_trie = tmp_trie[letter]
 return dictionary_trie
def matches(substring, trie):
 d = trie
 for letter in substring:
 try:
 d = d[letter]
 except KeyError:
 return []
 return d['words']

Usage example:

>>> import pprint
>>> dictionary = ['test', 'testing', 'hello', 'world', 'hai']
>>> trie = parse_dictionary(dictionary)
>>> pprint.pprint(trie)
{'h': {'a': {'i': {'words': ['hai']}, 'words': ['hai']},
 'e': {'l': {'l': {'o': {'words': ['hello']}, 'words': ['hello']},
 'words': ['hello']},
 'words': ['hello']},
 'words': ['hello', 'hai']},
 't': {'e': {'s': {'t': {'i': {'n': {'g': {'words': ['testing']},
 'words': ['testing']},
 'words': ['testing']},
 'words': ['test', 'testing']},
 'words': ['test', 'testing']},
 'words': ['test', 'testing']},
 'words': ['test', 'testing']},
 'w': {'o': {'r': {'l': {'d': {'words': ['world']}, 'words': ['world']},
 'words': ['world']},
 'words': ['world']},
 'words': ['world']}}
>>> matches('h', trie)
['hello', 'hai']
>>> matches('he', trie)
['hello']
>>> matches('asd', trie)
[]
>>> matches('test', trie)
['test', 'testing']
>>> 
answered Feb 22, 2013 at 23:12

Comments

0

You could make a list, let each line be one element of the list and do a binary search.

answered Feb 22, 2013 at 23:14

1 Comment

@nhahtdh, the question specifically states that the text to be found is a prefix i.e. at the start of the line.
0

So to explain arainchi's very nice answer, make a dictionary with an entry for every line in your file. Then you can match your search string against the names of those entries. Dictionaries are really handy for this kind of searching.

answered Feb 22, 2013 at 23:37

Comments

0

Using a trie still requires you to build a trie, which is O(n) to iterate the whole file-- taking advantage of the sorting will make it O(log_2 n). So, this faster solution would use a binary search (see below).

This solution still requires you to read in the entire file. In an even faster solution, you could pre-process the file and pad out all the lines so they're the same length (or build some kind of index structure in the file, to make seeking to the middle of the list feasible)-- then seeking to middle of the file would take you to the middle of the list. The "even faster" solution would probably only be needed for a really, really big file (gigabytes or hundreds of megabytes). You'd them combine this with the binary search.

Possibly, if the file-system supports sparse files-- doing the above padding scheme won't event increase the files actual blocks used on the disk.

Then, at that point, you're probably approaching a b-tree or a b+tree implementation to make the indexing efficient. So you could use a b-tree library.

Something like this:

import bisect
entries = ["a", "b", "c", "cc", "cd", "ce", "d", "e", "f" ]
def find_matches(ls, m):
 x = len(ls) / 2
 match_index = -1
 index = bisect.bisect_left(ls, m)
 matches = []
 while ls[index].startswith(m):
 matches.append(ls[index])
 index += 1
 return matches
print find_matches(entries, "c")

Output:

>>> ['c', 'cc', 'cd', 'ce']
answered Feb 22, 2013 at 23:33

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.