Replace list of list with "condensed" list of list while maintaining order
Here's a brute-force approach (it might be easier to understand):
from itertools import chaindef condense(*lists): # remember original positions positions = {} for pos, item in enumerate(chain(*lists)): if item not in positions: positions[item] = pos # condense disregarding order sets = condense_sets(map(set, lists)) # restore order result = [sorted(s, key=positions.get) for s in sets] return result if len(result) != 1 else result[0]def condense_sets(sets): result = [] for candidate in sets: for current in result: if candidate & current: # found overlap current |= candidate # combine (merge sets) # new items from candidate may create an overlap # between current set and the remaining result sets result = condense_sets(result) # merge such sets break else: # no common elements found (or result is empty) result.append(candidate) return result
Example
>>> condense([1,2,3], [10,5], [3,8,5])[1, 2, 3, 10, 5, 8]>>> a = [1,2,3]>>> b = [3,4]>>> i = [21,22]>>> c = [88,7,8]>>> e = [5,4]>>> d = [3, 50]>>> f = [8,9]>>> g= [9,10]>>> h = [20,21]>>> condense(*[a,b,c,i,e,d,f,g,h,a,c,i]*1000)[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]>>> condense([1], [2, 3, 2])[[1], [2, 3]]
Performance comparison
condense_*()
functions are from the answers to this question. lst_OP
input list from the question (different size), lst_BK
- the test list from @Blckknght's answer (different size). See the source.
Measurements show that solutions based on "disjoint sets" and "connected components of undirected graph" concepts perform similar on both types of input.
name time ratio commentcondense_hynekcer 5.79 msec 1.00 lst_OPcondense_hynekcer2 7.4 msec 1.28 lst_OPcondense_pillmuncher2 11.5 msec 1.99 lst_OPcondense_blckknght 16.8 msec 2.91 lst_OPcondense_jfs 26 msec 4.49 lst_OPcondense_BK 30.5 msec 5.28 lst_OPcondense_blckknght2 30.9 msec 5.34 lst_OPcondense_blckknght3 32.5 msec 5.62 lst_OPname time ratio commentcondense_blckknght 964 usec 1.00 lst_BKcondense_blckknght2 1.41 msec 1.47 lst_BKcondense_blckknght3 1.46 msec 1.51 lst_BKcondense_hynekcer2 1.95 msec 2.03 lst_BKcondense_pillmuncher2 2.1 msec 2.18 lst_BKcondense_hynekcer 2.12 msec 2.20 lst_BKcondense_BK 3.39 msec 3.52 lst_BKcondense_jfs 544 msec 563.66 lst_BKname time ratio commentcondense_hynekcer 6.86 msec 1.00 lst_rndcondense_jfs 16.8 msec 2.44 lst_rndcondense_blckknght 28.6 msec 4.16 lst_rndcondense_blckknght2 56.1 msec 8.18 lst_rndcondense_blckknght3 56.3 msec 8.21 lst_rndcondense_BK 70.2 msec 10.23 lst_rndcondense_pillmuncher2 324 msec 47.22 lst_rndcondense_hynekcer2 334 msec 48.61 lst_rnd
To reproduce results clone gist and run measure-performance-condense-lists.py
Here's my approach. It uses the concept of a "disjoint set" to first identify which sublists overlap with each other, then it joins them together, eliminating duplicates.
from collections import OrderedDictdef disjoint_set_find(djs, node): # disjoint set find, with path compression if node not in djs: # base case, node is a root of a set return node djs[node] = disjoint_set_find(djs, djs[node]) # recurse, caching results return djs[node]def disjoint_set_union(djs, first, second): first = disjoint_set_find(djs, first) # find root of first set second = disjoint_set_find(djs, second) # and of second set if first < second: # make smaller root the root of the new combined set djs[second] = first elif second < first: djs[first] = second # deliberately ignore the case where first == second (same set)def condenseBK(*master_list): values = OrderedDict() # maps values to the first sublist containing them overlaps = {} # maps sublist indexes to each other to form a disjoint set for i, sublist in enumerate(master_list): for v in sublist: if v not in values: # no overlap, so just store value values[v] = i else: # overlap detected, do a disjoint set union disjoint_set_union(overlaps, values[v], i) output = [] # results output_map = {} # map from original indexes to output indexes for v, i, in values.items(): # iterate over values in order root = disjoint_set_find(overlaps, i) if root not in output_map: output_map[i] = len(output) output.append([]) # create new output sublists as necessary output[output_map[root]].append(v) return output
Sample output:
>>> a = [1,2,3]>>> b = [3,4]>>> c = [88,7,8]>>> d = [3, 50]>>> e = [5,4]>>> f = [8,9]>>> g = [9,10]>>> h = [20,21]>>> i = [21,22]>>> lst = [a,b,c,i,e,d,f,g,h,a,c,i]*1000>>> condenseBK(*lst)[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
An explanation of the algorithm:
By request, here's an explanation for how my code works.
The first two functions implement the find
and union
operations of a disjoint set. The data structure is implemented with a dictionary mapping nodes to their parent nodes. Any node that is not a key of the dictionary is the root
of a set. The find
operation returns the root node of the set containing a given node
. To help performance a bit, I've implemented "path compression", which reduces the number of recursive steps needed over time. The union
operation combines the sets containing its arguments first
and second
.
The main condense
function has two parts. First, it sets up a couple of data structures, then it uses them to build the output.
values
is an OrderedDictionary that maps from each value to the index of the first sublist it is contained in. The order each value is added is used to produce the output in the correct order.
overlaps
is the dictionary used as for the disjoint set. Its nodes are the integer indexes of overlapping sublists.
The first loops fill up those two data structures. They loop over the sublists and their contents. If a value has not been seen before, it is added to the values
dictionary. If it has been seen, the current sublist is overlapping a previous sublist containing that value.
To resolve the overlap, the code does a union of the disjoint sets that contain the two sublists.
The output is built in the output
list. Because there are likely to be fewer output sublists than there were in the input, we need an additional dictionary to map between the old indexes (from the input) to the new indexes that apply to the output list.
To fill up the output list, we iterate over the values, which happens in the order they were added thanks to using the OrderedDict class. Using the disjoint set, it can combine the overlapping lists correctly.
This algorithm has very good performance when there are a lot of lists to be processed that don't overlap immediately. For instance, this set of 200 three-element lists ultimately all overlaps, but you only start seeing the overlaps appear after the first 100 have been processed:
lst2 = [list(range(4*i, 4*(i+1)-1)) for i in range(100)] + \ [list(range(4*i+2, 4*(i+1)+1)) for i in range(100)]
I am sure there is a cleaner way to do this but I started down a certain path and did what I had to to make it work without any refactoring.
lookup = {}out = []index = 0for grp in lst: keys = [lookup.get(num, None) for num in grp] keys = [key for key in keys if key is not None] if len(keys): if len(set(keys)) != 1: for num in grp: out[keys[0]].append(num) seen = set() keys = [key for key in keys if key not in seen and not seen.add(key)] for key in keys[1:]: out[keys[0]].extend(out[key]) del out[key] seen = set() out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)] else: for num in grp: lookup[num] = keys[0] out[keys[0]].extend(grp) seen = set() out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)] else: out.append(grp) for num in grp: lookup[num] = index index += 1print out
Thanks to @Steven for the list reduction technique with the set.
OUTPUT
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]