Efficient data structures for building a fast spell checker

Question 1

I'm trying to write a spell-checker which should work with a pretty large dictionary. I really want an efficient way to index my dictionary data to be used using a Damerau-Levenshtein distance to determine which words are closest to the misspelled word.

I'm looking for a data structure who would give me the best compromise between space complexity and runtime complexity.

Based on what I found on the internet, I have a few leads regarding what type of data structure to use:

Trie

trie-500px

This is my first thought and looks pretty easy to implement and should provide fast lookup/insertion. Approximate search using Damerau-Levenshtein should be simple to implement here as well. But it doesn't look very efficient in terms of space complexity since you most likely have a lot of overhead with pointers storage.

Patricia Trie

trie-500px

This seems to consume less space than a regular Trie since you're basically avoiding the cost of storing the pointers, but I'm a bit worried about data fragmentation in case of very large dictionaries like what I have.

Suffix Tree

suffix-500px

I'm not sure about this one, it seems like some people do find it useful in text mining, but I'm not really sure what it would give in terms of performance for a spell checker.

Ternary Search Tree

tst

These look pretty nice and in terms of complexity should be close (better?) to Patricia Tries, but I'm not sure regarding fragmentation if it would be better of worse than Patricia Tries.

Burst Tree

burst

This seems kind of hybrid and I'm not sure what advantage it would have over Tries and the like, but I've read several times that it's very efficient for text mining.

I would like to get some feedback as to which data structure would be best to use in this context and what makes it better than the other ones. If I'm missing some data structures who would be even more appropriate for a spell-checker, I'm very interested as well.

Question 2

How does a patricia trie avoid the cost of storing the pointers? Is it just a en.wikipedia.org/wiki/Radix_tree? If that's the case, then I think it still stores lots of pointers, but you'll have huge space savings because common prefixes are only stored once

Question 3

Without specifics the comparison you request may be impossible. In particular, what density does your dictionary have? Up to which distance do you want to detect spelling mistakes (you seem to say unbounded)? What is the density of dictionary + variants? (Density here means the fraction of all words of length $\geq n$ that is contained in the stored set.)

Question 4

@linker: Have you tried all variants for your dictionary? Given a fixed use case, that is probably the fastest way to find out which datastructure consumes how much space.

Question 5

It's just a basic dictionary, just a known list of correctly spelled words.

Question 6

See also this most closely related question.

Question 7

I've encountered same problem, but took different approach. You can construct some kind of "hash" function, which will for similar word give same or near number.

Problem is, that function which will give "good" result for words with inserting/removing, will give "bad" for transition, and vice versa. Example: map letters to numbers, similar letter to adjacent numbers, and just sum them for every letter in word. Then create hash-tables with sets for each key and find intersection for word.

May be some results can be achieved if we look at "space" оf words. X for changing letter, Y for adding/removing, Z for transition, or something like that.

However this is only abstract ideas, I haven't enough time to implement them.

Question 8

This is what Soundex does en.wikipedia.org/wiki/Soundex

Question 9

You may want to use a metric tree for fast research, if brute force is not possible (always try brut force). Finding your neighbours will be possible in $\mathcal O(\log (n)),ドル but the constant in $\mathcal O$ may be big. You seem to have millions of strings, so it may be a good trade-off.

Do not store the strings in the metric tree. Just store an index, and store the strings in a Patricia tree.

I am not sure which tree you should use. It will depend on your data and your requirements (do you need fast insert ?). Update your question if you find that one tree is more efficient than the others.

You may also look at specialized tools, like lucene.

Question 10

"spellchecker" pretty much means no insert. Anyway, does "store the strings in a Patricia tree" make sense? ; isn't this going to hog the ram?

MadRunner MadRunner 1492 bronze badges · Answer 1 · 2012-05-04 15:50:48Z

I've encountered same problem, but took different approach. You can construct some kind of "hash" function, which will for similar word give same or near number.

Problem is, that function which will give "good" result for words with inserting/removing, will give "bad" for transition, and vice versa. Example: map letters to numbers, similar letter to adjacent numbers, and just sum them for every letter in word. Then create hash-tables with sets for each key and find intersection for word.

May be some results can be achieved if we look at "space" оf words. X for changing letter, Y for adding/removing, Z for transition, or something like that.

However this is only abstract ideas, I haven't enough time to implement them.

$\begingroup$ This is what Soundex does en.wikipedia.org/wiki/Soundex $\endgroup$

rgrig
– rgrig

2012年05月10日 18:09:41 +00:00
Commented May 10, 2012 at 18:09

oao oao 1412 bronze badges · Answer 2 · 2013-06-04 12:08:51Z

You may want to use a metric tree for fast research, if brute force is not possible (always try brut force). Finding your neighbours will be possible in $\mathcal O(\log (n)),ドル but the constant in $\mathcal O$ may be big. You seem to have millions of strings, so it may be a good trade-off.

Do not store the strings in the metric tree. Just store an index, and store the strings in a Patricia tree.

I am not sure which tree you should use. It will depend on your data and your requirements (do you need fast insert ?). Update your question if you find that one tree is more efficient than the others.

You may also look at specialized tools, like lucene.

"spellchecker" pretty much means no insert. Anyway, does "store the strings in a Patricia tree" make sense? ; isn't this going to hog the ram?

Stack Exchange Network

Efficient data structures for building a fast spell checker

Trie

Patricia Trie

Suffix Tree

Ternary Search Tree

Burst Tree

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Efficient data structures for building a fast spell checker

Trie

Patricia Trie

Suffix Tree

Ternary Search Tree

Burst Tree

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions