Javascript fuzzy search that makes sense
I tried using existing fuzzy libraries like fuse.js and also found them to be terrible, so I wrote one which behaves basically like sublime's search. https://github.com/farzher/fuzzysort
The only typo it allows is a transpose. It's pretty solid (1k stars, 0 issues), very fast, and handles your case easily:
fuzzysort.go('int', ['international', 'splint', 'tinder'])// [{highlighted: '*int*ernational', score: 10}, {highlighted: 'spl*int*', socre: 3003}]
Good question! But my thought is that, rather than trying to modify Levenshtein-Demerau, you might be better to try a different algorithm or combine/ weight the results from two algorithms.
It strikes me that exact or close matches to the "starting prefix" are something Levenshtein-Demerau gives no particular weight to -- but your apparent user expectations would.
I searched for "better than Levenshtein" and, among other things, found this:
http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
This mentions a number of "string distance" measures. Three which looked particularly relevant to your requirement, would be:
Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.
Maybe you could use a weighted combination (or minimum) of these metrics, with Levenshtein -- common substring, common N-gram or Jaccard will all strongly prefer similar strings -- or perhaps try just using Jaccard?
Depending on the size of your list/ database, these algorithms can be moderately expensive. For a fuzzy search I implemented, I used a configurable number of N-grams as "retrieval keys" from the DB then ran the expensive string-distance measure to sort them in preference order.
I wrote some notes on Fuzzy String Search in SQL. See:
Here is a technique I have used a few times...It gives pretty good results. Does not do everything you asked for though. Also, this can be expensive if the list is massive.
get_bigrams = (string) -> s = string.toLowerCase() v = new Array(s.length - 1) for i in [0..v.length] by 1 v[i] = s.slice(i, i + 2) return vstring_similarity = (str1, str2) -> if str1.length > 0 and str2.length > 0 pairs1 = get_bigrams(str1) pairs2 = get_bigrams(str2) union = pairs1.length + pairs2.length hit_count = 0 for x in pairs1 for y in pairs2 if x is y hit_count++ if hit_count > 0 return ((2.0 * hit_count) / union) return 0.0
Pass two strings to string_similarity
which will return a number between 0
and 1.0
depending on how similar they are. This example uses Lo-Dash
Usage Example....
query = 'jenny Jackson'names = ['John Jackson', 'Jack Johnson', 'Jerry Smith', 'Jenny Smith']results = []for name in names relevance = string_similarity(query, name) obj = {name: name, relevance: relevance} results.push(obj)results = _.first(_.sortBy(results, 'relevance').reverse(), 10)console.log results
Also....have a fiddle
Make sure your console is open or you wont see anything :)