fuzzy string matching with term weights

Here's a weird idea for you:

Compress your input and diff that.

You could use e.g. Huffman or dictionary coder to compress your input, that automatically takes care of common terms. It may not do so well for typos though, in your example, London is probably a relatively common word, while misspelt Lundon is not at all, and dissimilarity between compressed terms is much higher than between raw terms.

python string information-retrieval

how about splitting each string into a list of words, and running your comparison on each word to get a list which holds the scores of word matches. then you can average the scores, find the lowest/highest indirect match or partials...

gives you the ability to add your own weight.

you would of course need to handle offsets like..

"the london company for leather"

and

"london company for leather"

python string information-retrieval

In my opinion, a general solution will never match your idea of similarity. As soon as you have some implicit knowledge about your data, you have to put that somehow into code. Which imediately disqualifies a fixed existing solution.

Perhaps you should have look at http://nltk.org/ to get an idea of some NLP techniques. You don't tell us enough about your data, but a POS tagger might help to identify more and less relevant terms. Available databases with names of cities, countries, ... might help to clean up the data before processing it further.

There are many tools available, but to get high quality output, you will need a solution which is customized for your data and use case.

CodeHunter

fuzzy string matching with term weights

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last