Efficiently build a graph of words with given Hamming distance

python algorithm graph-algorithm hamming-distance

Assuming you store your dictionary in a set(), so that lookup is O(1) in the average (worst case O(n)).

You can generate all the valid words at hamming distance 1 from a word:

>>> def neighbours(word):...     for j in range(len(word)):...         for d in string.ascii_lowercase:...             word1 = ''.join(d if i==j else c for i,c in enumerate(word))...             if word1 != word and word1 in words: yield word1...>>> {word: list(neighbours(word)) for word in words}{'bot': ['lot'], 'lol': ['lot'], 'lot': ['bot', 'lol']}

If M is the length of a word, L the length of the alphabet (i.e. 26), the worst case time complexity of finding neighbouring words with this approach is O(L*M*N).

The time complexity of the "easy way" approach is O(N^2).

When this approach is better? When L*M < N, i.e. if considering only lowercase letters, when M < N/26. (I considered only worst case here)

Note: the average length of an english word is 5.1 letters. Thus, you should consider this approach if your dictionary size is bigger than 132 words.

Probably it is possible to achieve better performance than this. However this was really simple to implement.

Experimental benchmark:

The "easy way" algorithm (A1):

from itertools import zip_longestdef hammingdist(w1,w2): return sum(1 if c1!=c2 else 0 for c1,c2 in zip_longest(w1,w2))def graph1(words): return {word: [n for n in words if hammingdist(word,n) == 1] for word in words}

This algorithm (A2):

def graph2(words): return {word: list(neighbours(word)) for word in words}

Benchmarking code:

for dict_size in range(100,6000,100):    words = set([''.join(random.choice(string.ascii_lowercase) for x in range(3)) for _ in range(dict_size)])    t1 = Timer(lambda: graph1()).timeit(10)    t2 = Timer(lambda: graph2()).timeit(10)    print('%d,%f,%f' % (dict_size,t1,t2))

Output:

100,0.119276,0.136940200,0.459325,0.233766300,0.958735,0.325848400,1.706914,0.446965500,2.744136,0.545569600,3.748029,0.682245700,5.443656,0.773449800,6.773326,0.874296900,8.535195,0.9969291000,10.445875,1.1262411100,12.510936,1.179570...

data plot

I ran another benchmark with smaller steps of N to see it closer:

10,0.002243,0.02634320,0.010982,0.07057230,0.023949,0.07316940,0.035697,0.09090850,0.057658,0.11472560,0.079863,0.13546270,0.107428,0.15941080,0.142211,0.17651290,0.182526,0.210243100,0.217721,0.218544110,0.268710,0.256711120,0.334201,0.268040130,0.383052,0.291999140,0.427078,0.312975150,0.501833,0.338531160,0.637434,0.355136170,0.635296,0.369626180,0.698631,0.400146190,0.904568,0.444710200,1.024610,0.486549210,1.008412,0.459280220,1.056356,0.501408...

data plot 2

You see the tradeoff is very low (100 for dictionaries of words with length=3). For small dictionaries the O(N^2) algorithm perform slightly better, but that is easily beat by the O(LMN) algorithm as N grows.

For dictionaries with longer words, the O(LMN) algorithm remains linear in N, it just has a different slope, so the tradeoff moves slightly to the right (130 for length=5).

python algorithm graph-algorithm hamming-distance

There's no need to take a dependency on the alphabet size. Given a word bot, for example, insert it into a dictionary of word lists under the keys ?ot, b?t, bo?. Then, for each word list, connect all pairs.

import collectionsd = collections.defaultdict(list)with open('/usr/share/dict/words') as f:    for line in f:        for word in line.split():            if len(word) == 6:                for i in range(len(word)):                    d[word[:i] + ' ' + word[i + 1:]].append(word)pairs = [(word1, word2) for s in d.values() for word1 in s for word2 in s if word1 < word2]print(len(pairs))

python algorithm graph-algorithm hamming-distance

Ternary Search Trie supports Near-Neighbor Searching pretty well.

If your dictionary is stored as TST then, I believe, average complexity of lookups while building your graph would be close to O(N*log(N)) on real world word dictionaries.

And check Efficient auto-complete with a ternary search tree article.

CodeHunter

Efficiently build a graph of words with given Hamming distance

Experimental benchmark:

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last