43
$\begingroup$

I'm trying to write a spell-checker which should work with a pretty large dictionary. I really want an efficient way to index my dictionary data to be used using a Damerau-Levenshtein distance to determine which words are closest to the misspelled word.

I'm looking for a data structure who would give me the best compromise between space complexity and runtime complexity.

Based on what I found on the internet, I have a few leads regarding what type of data structure to use:

Trie

trie-500px

This is my first thought and looks pretty easy to implement and should provide fast lookup/insertion. Approximate search using Damerau-Levenshtein should be simple to implement here as well. But it doesn't look very efficient in terms of space complexity since you most likely have a lot of overhead with pointers storage.

Patricia Trie

trie-500px

This seems to consume less space than a regular Trie since you're basically avoiding the cost of storing the pointers, but I'm a bit worried about data fragmentation in case of very large dictionaries like what I have.

Suffix Tree

suffix-500px

I'm not sure about this one, it seems like some people do find it useful in text mining, but I'm not really sure what it would give in terms of performance for a spell checker.

Ternary Search Tree

tst

These look pretty nice and in terms of complexity should be close (better?) to Patricia Tries, but I'm not sure regarding fragmentation if it would be better of worse than Patricia Tries.

Burst Tree

burst

This seems kind of hybrid and I'm not sure what advantage it would have over Tries and the like, but I've read several times that it's very efficient for text mining.


I would like to get some feedback as to which data structure would be best to use in this context and what makes it better than the other ones. If I'm missing some data structures who would be even more appropriate for a spell-checker, I'm very interested as well.

Raphael
73.2k31 gold badges183 silver badges403 bronze badges
asked May 2, 2012 at 3:07
$\endgroup$
11
  • $\begingroup$ How does a patricia trie avoid the cost of storing the pointers? Is it just a en.wikipedia.org/wiki/Radix_tree? If that's the case, then I think it still stores lots of pointers, but you'll have huge space savings because common prefixes are only stored once $\endgroup$ Commented May 2, 2012 at 7:26
  • $\begingroup$ Without specifics the comparison you request may be impossible. In particular, what density does your dictionary have? Up to which distance do you want to detect spelling mistakes (you seem to say unbounded)? What is the density of dictionary + variants? (Density here means the fraction of all words of length $\geq n$ that is contained in the stored set.) $\endgroup$ Commented May 9, 2012 at 13:56
  • 1
    $\begingroup$ @linker: Have you tried all variants for your dictionary? Given a fixed use case, that is probably the fastest way to find out which datastructure consumes how much space. $\endgroup$ Commented May 9, 2012 at 15:52
  • 2
    $\begingroup$ It's just a basic dictionary, just a known list of correctly spelled words. $\endgroup$ Commented May 9, 2012 at 17:19
  • 1
    $\begingroup$ See also this most closely related question. $\endgroup$ Commented May 29, 2012 at 16:26

2 Answers 2

4
$\begingroup$

I've encountered same problem, but took different approach. You can construct some kind of "hash" function, which will for similar word give same or near number.

Problem is, that function which will give "good" result for words with inserting/removing, will give "bad" for transition, and vice versa. Example: map letters to numbers, similar letter to adjacent numbers, and just sum them for every letter in word. Then create hash-tables with sets for each key and find intersection for word.

May be some results can be achieved if we look at "space" оf words. X for changing letter, Y for adding/removing, Z for transition, or something like that.

However this is only abstract ideas, I haven't enough time to implement them.

answered May 4, 2012 at 15:50
$\endgroup$
1
4
$\begingroup$

You may want to use a metric tree for fast research, if brute force is not possible (always try brut force). Finding your neighbours will be possible in $\mathcal O(\log (n)),ドル but the constant in $\mathcal O$ may be big. You seem to have millions of strings, so it may be a good trade-off.

Do not store the strings in the metric tree. Just store an index, and store the strings in a Patricia tree.

I am not sure which tree you should use. It will depend on your data and your requirements (do you need fast insert ?). Update your question if you find that one tree is more efficient than the others.

You may also look at specialized tools, like lucene.

frafl
2,3391 gold badge17 silver badges33 bronze badges
answered Jun 4, 2013 at 12:08
$\endgroup$
1
  • $\begingroup$ "spellchecker" pretty much means no insert. Anyway, does "store the strings in a Patricia tree" make sense? ; isn't this going to hog the ram? $\endgroup$ Commented Apr 8, 2023 at 23:20

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.