Cosine similarity vs Hamming distance [closed]

A Hamming distance should be done between two strings of equal length and with the order taken into account.

As your documents are certainly of different length and if the words places do not count, cosine similarity is better (please note that depending your needs, better solutions exist). :)

Here is a cosine similarity function of 2 arrays of words:

function cosineSimilarity($tokensA, $tokensB){    $a = $b = $c = 0;    $uniqueTokensA = $uniqueTokensB = array();    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;    foreach ($uniqueMergedTokens as $token) {        $x = isset($uniqueTokensA[$token]) ? 1 : 0;        $y = isset($uniqueTokensB[$token]) ? 1 : 0;        $a += $x * $y;        $b += $x;        $c += $y;    }    return $b * $c != 0 ? $a / sqrt($b * $c) : 0;}

It is fast (isset() instead of in_array() is a killer on large arrays).

As you can see, the results does not take into account the "magnitude" of each the word.

I use it to detect multi-posted messages of "almost" copy-pasted texts. It works well. :)

The best link about string similarity metrics:http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

For further interesting readings:

http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298

php relationship similarity

Unless I'm mistaken, I think you've got an algorithm halfway between the two algorithms. For Hamming distance, use:

function check ($terms1, $terms2) {    $counts1 = array_count_values($terms1);    $totalScore = 0;    foreach ($terms2 as $term) {        if (isset($counts1[$term])) $totalScore += 1;    }    return $totalScore * 500 / (count($terms1) * count($terms2));}

(Note that you're only adding 1 for each matched element in the token vectors.)

And for cosine similarity, use:

function check ($terms1, $terms2) {    $counts1 = array_count_values($terms1);    $counts2 = array_count_values($terms2);    $totalScore = 0;    foreach ($terms2 as $term) {        if (isset($counts1[$term])) $totalScore += $counts1[$term] * $counts2[$term];    }    return $totalScore / (count($terms1) * count($terms2));}

(Note that you're adding the product of the token counts between the two documents.)

The main difference between the two is that cosine similarity will yield a stronger indicator when two documents have the same word multiple times in the documents, while Hamming distance doesn't care how often the individual tokens come up.

Edit: just noticed your query about removing function words etc. I do advise this if you're going to use cosine similarity - as function words are quite frequent (in English, at least), you might skew a result by not filtering them out. If you use Hamming distance, the effect will not be quite as great, but it could still be appreciable in some cases. Also, if you have access to a lemmatizer, it will reduce the misses when one document contains "galaxies" and the other contains "galaxy", for instance.

Whichever way you go, good luck!

php relationship similarity

I apologize for ignoring the fact that you said that you didn't want to use any other algorithms, but seriously, Levenshtein distance and Damerau-Levenshtein distance are way more freakin' useful than Hamming distance. Here's a D-L distance implementation in PHP, and if you don't like PHP's native levenshtein() function, which I think you won't because it has a length limit, here's a non-length-limited version:

function levenshtein_distance($text1, $text2) {    $len1 = strlen($text1);    $len2 = strlen($text2);    for($i = 0; $i <= $len1; $i++)        $distance[$i][0] = $i;    for($j = 0; $j <= $len2; $j++)        $distance[0][$j] = $j;    for($i = 1; $i <= $len1; $i++)        for($j = 1; $j <= $len2; $j++)            $distance[$i][$j] = min($distance[$i - 1][$j] + 1, $distance[$i][$j - 1] + 1, $distance[$i - 1][$j - 1] + ($text1[$i - 1] != $text2[$j - 1]));    return $distance[$len1][$len2];}

CodeHunter

Cosine similarity vs Hamming distance [closed]

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last