fuzzy string matching with term weights fuzzy string matching with term weights python python

fuzzy string matching with term weights


Here's a weird idea for you:

Compress your input and diff that.

You could use e.g. Huffman or dictionary coder to compress your input, that automatically takes care of common terms. It may not do so well for typos though, in your example, London is probably a relatively common word, while misspelt Lundon is not at all, and dissimilarity between compressed terms is much higher than between raw terms.


how about splitting each string into a list of words, and running your comparison on each word to get a list which holds the scores of word matches. then you can average the scores, find the lowest/highest indirect match or partials...

gives you the ability to add your own weight.

you would of course need to handle offsets like..

"the london company for leather"

and

"london company for leather"


In my opinion, a general solution will never match your idea of similarity. As soon as you have some implicit knowledge about your data, you have to put that somehow into code. Which imediately disqualifies a fixed existing solution.

Perhaps you should have look at http://nltk.org/ to get an idea of some NLP techniques. You don't tell us enough about your data, but a POS tagger might help to identify more and less relevant terms. Available databases with names of cities, countries, ... might help to clean up the data before processing it further.

There are many tools available, but to get high quality output, you will need a solution which is customized for your data and use case.