Levenshtein: MySQL + PHP Levenshtein: MySQL + PHP mysql mysql

Levenshtein: MySQL + PHP


You need a levenshtein function in MySQL and query like

$word = mysql_real_escape_string($word);mysql_qery("SELECT `term` FROM `words` WHERE levenshtein('$word', `term`) BETWEEN 0 AND 4");


There are two ways to implement a Levenshtein function in MySQL. The first is to create a STORED FUNCTION which operates much like a STORED TRANSACTION, except it has distinct inputs and an output. This is fine for small datasets, but a little slow on anything approaching several thousand rows.

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )RETURNS INTDETERMINISTICBEGINDECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;DECLARE s1_char CHAR;-- max strlen=255DECLARE cv0, cv1 VARBINARY(256);SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;IF s1 = s2 THEN  RETURN 0;ELSEIF s1_len = 0 THEN  RETURN s2_len;ELSEIF s2_len = 0 THEN  RETURN s1_len;ELSE  WHILE j <= s2_len DO    SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;  END WHILE;  WHILE i <= s1_len DO    SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;    WHILE j <= s2_len DO    SET c = c + 1;    IF s1_char = SUBSTRING(s2, j, 1) THEN      SET cost = 0; ELSE SET cost = 1;    END IF;    SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;    IF c > c_temp THEN SET c = c_temp; END IF;      SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;      IF c > c_temp THEN        SET c = c_temp;      END IF;      SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;    END WHILE;    SET cv1 = cv0, i = i + 1;  END WHILE;END IF;RETURN c;END//

Store the above code in a .sql file and import it into your database like so:

source /tmp/mysql_udf.sql

The second method is to implement a User Defined Function in C/C++ and link it into MySQL as a shared library (*.so file). This method also uses a STORED FUNCTION to call the library, which means the actual query for this or the first method may be identical (providing the inputs to both functions are the same). You can find out more about this method here: http://samjlevy.com/mysql-levenshtein-and-damerau-levenshtein-udfs/

With either of these methods, your query would be something like:

SELECT term FROM words WHERE levenshtein(term, 'term') < 5;

Also, remember that the 'threshold' value should change in relation to the original word length. It's better to think of it in terms of a percentage value, i.e. half your word = 50%, half of 'term' = 2.


If you have a huge database, you can filter the words first using SOUNDEX:

$word = strtolower(mysql_real_escape_string($_GET['term']));$rs = mysql_query("SELECT LOWER(`term`) FROM `words` WHERE SOUNDEX(term) = SOUNDEX(" . $word . ")");while ($row = mysql_fetch_assoc($rs)) {     $lev = levenshtein($word, $row['term']);    ....}

If you have time enough to play with a C extension or procedure, you may achieve better performance, but filtering the records on mysql before applying real levenshtein will make things faster with almost no effort.