Creating a "spell check" that checks against a database with a reasonable runtime

c# database algorithm runtime spell-checking

Calculating Levenshtein distance doesn't have to be as costly as you might think. The code in the Norvig article can be thought of as psuedocode to help the reader understand the algorithm. A much more efficient implementation (in my case, approx 300 times faster on a 20,000 term data set) is to walk a trie. The performance difference is mostly attributed to removing the need to allocate millions of strings in order to do dictionary lookups, spending much less time in the GC, and you also get better locality of reference so have fewer CPU cache misses. With this approach I am able to do lookups in around 2ms on my web server. An added bonus is the ability to return all results that start with the provided string easily.

The downside is that creating the trie is slow (can take a second or so), so if the source data changes regularly then you need to decide whether to rebuild the whole thing or apply deltas. At any rate, you want to reuse the structure as much as possible once it's built.

c# database algorithm runtime spell-checking

As Darcara said, a BK-Tree is a good first take. They are very easy to implement. There are several free implementations easily found via Google, but a better introduction to the algorithm can be found here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees.

Unfortunately, calculating the Levenshtein distance is pretty costly, and you'll be doing it a lot if you're using a BK-Tree with a large dictionary. For better performance, you might consider Levenshtein Automata. A bit harder to implement, but also more efficient, and they can be used to solve your problem. The same awesome blogger has the details: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata. This paper might also be interesting: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652.

c# database algorithm runtime spell-checking

I guess the Levenshtein distance is more useful here than the Hamming distance.

Let's take an example: We take the word example and restrict ourselves to a Levenshtein distance of 1. Then we can enumerate all possible misspellings that exist:

1 insertion (208)
- aexample
- bexample
- cexample
- ...
- examplex
- exampley
- examplez
1 deletion (7)
- xample
- eample
- exmple
- ...
- exampl
1 substitution (182)
- axample
- bxample
- cxample
- ...
- examplz

You could store each misspelling in the database, and link that to the correct spelling, example. That works and would be quite fast, but creates a huge database.

Notice how most misspellings occur by doing the same operation with a different character:

1 insertion (8)
- ?example
- e?xample
- ex?ample
- exa?mple
- exam?ple
- examp?le
- exampl?e
- example?
1 deletion (7)
- xample
- eample
- exmple
- exaple
- examle
- exampe
- exampl
1 substitution (7)
- ?xample
- e?ample
- ex?mple
- exa?ple
- exam?le
- examp?e
- exampl?

That looks quite manageable. You could generate all these "hints" for each word and store them in the database. When the user enters a word, generate all "hints" from that and query the database.

Example: User enters exaple (notice missing m).

SELECT DISTINCT word           FROM dictionary          WHERE hint = '?exaple'             OR hint = 'e?xaple'             OR hint = 'ex?aple'             OR hint = 'exa?ple'             OR hint = 'exap?le'             OR hint = 'exapl?e'             OR hint = 'exaple?'             OR hint = 'xaple'             OR hint = 'eaple'             OR hint = 'exple'             OR hint = 'exale'             OR hint = 'exape'             OR hint = 'exapl'             OR hint = '?xaple'             OR hint = 'e?aple'             OR hint = 'ex?ple'             OR hint = 'exa?le'             OR hint = 'exap?e'             OR hint = 'exapl?'

exaple with 1 insertion == exa?ple == example with 1 substitution

CodeHunter

Creating a "spell check" that checks against a database with a reasonable runtime

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last