Pythonic way to merge two overlapping lists, preserving order Pythonic way to merge two overlapping lists, preserving order python-3.x python-3.x

Pythonic way to merge two overlapping lists, preserving order


You can try the following:

>>> a = [1, 3, 9, 8, 3, 4, 5]>>> b = [3, 4, 5, 7, 8]>>> matches = (i for i in xrange(len(b), 0, -1) if b[:i] == a[-i:])>>> i = next(matches, 0)>>> a + b[i:][1, 3, 9, 8, 3, 4, 5, 7, 8]

The idea is we check the first i elements of b (b[:i]) with the last i elements of a (a[-i:]). We take i in decreasing order, starting from the length of b until 1 (xrange(len(b), 0, -1)) because we want to match as much as possible. We take the first such i by using next and if we don't find it we use the zero value (next(..., 0)). From the moment we found the i, we add to a the elements of b from index i.


There are a couple of easy optimizations that are possible.

  1. You don't need to start at master[1], since the longest overlap starts at master[-len(addition)]

  2. If you add a call to list.index you can avoid creating sub-lists and comparing lists for each index:

This approach keeps the code pretty understandable too (and easier to optimize by using cython or pypy):

master = [1,3,9,8,3,4,5]addition = [3,4,5,7,8]def merge(master, addition):    first = addition[0]    n = max(len(master) - len(addition), 1)  # (1)    while 1:        try:            n = master.index(first, n)       # (2)        except ValueError:            return master + addition        if master[-n:] == addition[:n]:            return master + addition[n:]        n += 1


One trivial optimization is not iterating over the whole master list. I.e., replace while n < len(master) with for n in range(min(len(addition), len(master))) (and don't increment n in the loop). If there is no match, your current code will iterate over the entire master list, even if the slices being compared aren't even of the same length.

Another concern is that you're taking slices of master and addition in order to compare them, which creates two new lists every time, and isn't really necessary. This solution (inspired by Boyer-Moore) doesn't use slicing:

def merge(master, addition):    overlap_lens = (i + 1 for i, e in enumerate(addition) if e == master[-1])    for overlap_len in overlap_lens:        for i in range(overlap_len):            if master[-overlap_len + i] != addition[i]:                break        else:            return master + addition[overlap_len:]    return master + addition

The idea here is to generate all the indices of the last element of master in addition, and add 1 to each. Since a valid overlap must end with the last element of master, only those values are lengths of possible overlaps. Then we can check for each of them if the elements before it also line up.

The function currently assumes that master is longer than addition (you'll probably get an IndexError at master[-overlap_len + i] if it isn't). Add a condition to the overlap_lens generator if you can't guarantee it.

It's also non-greedy, i.e. it looks for the smallest non-empty overlap (merge([1, 2, 2], [2, 2, 3]) will return [1, 2, 2, 2, 3]). I think that's what you meant by "to merge at the last possible valid position". If you want a greedy version, reverse the overlap_lens generator.