How do I find the duplicates in a list and create another list with them? How do I find the duplicates in a list and create another list with them? python python

How do I find the duplicates in a list and create another list with them?


To remove duplicates use set(a). To print duplicates, something like:

a = [1,2,3,2,1,5,6,5,5,5]import collectionsprint([item for item, count in collections.Counter(a).items() if count > 1])## [1, 2, 5]

Note that Counter is not particularly efficient (timings) and probably overkill here. set will perform better. This code computes a list of unique elements in the source order:

seen = set()uniq = []for x in a:    if x not in seen:        uniq.append(x)        seen.add(x)

or, more concisely:

seen = set()uniq = [x for x in a if x in seen or seen.add(x)]    

I don't recommend the latter style, because it is not obvious what not seen.add(x) is doing (the set add() method always returns None, hence the need for not).

To compute the list of duplicated elements without libraries:

seen = {}dupes = []for x in a:    if x not in seen:        seen[x] = 1    else:        if seen[x] == 1:            dupes.append(x)        seen[x] += 1

If list elements are not hashable, you cannot use sets/dicts and have to resort to a quadratic time solution (compare each with each). For example:

a = [[1], [2], [3], [1], [5], [3]]no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]print no_dupes # [[1], [2], [3], [5]]dupes = [x for n, x in enumerate(a) if x in a[:n]]print dupes # [[1], [3]]


>>> l = [1,2,3,4,4,5,5,6,1]>>> set([x for x in l if l.count(x) > 1])set([1, 4, 5])


You don't need the count, just whether or not the item was seen before. Adapted that answer to this problem:

def list_duplicates(seq):  seen = set()  seen_add = seen.add  # adds all elements it doesn't know yet to seen and all other to seen_twice  seen_twice = set( x for x in seq if x in seen or seen_add(x) )  # turn the set into a list (as requested)  return list( seen_twice )a = [1,2,3,2,1,5,6,5,5,5]list_duplicates(a) # yields [1, 2, 5]

Just in case speed matters, here are some timings:

# file: test.pyimport collectionsdef thg435(l):    return [x for x, y in collections.Counter(l).items() if y > 1]def moooeeeep(l):    seen = set()    seen_add = seen.add    # adds all elements it doesn't know yet to seen and all other to seen_twice    seen_twice = set( x for x in l if x in seen or seen_add(x) )    # turn the set into a list (as requested)    return list( seen_twice )def RiteshKumar(l):    return list(set([x for x in l if l.count(x) > 1]))def JohnLaRooy(L):    seen = set()    seen2 = set()    seen_add = seen.add    seen2_add = seen2.add    for item in L:        if item in seen:            seen2_add(item)        else:            seen_add(item)    return list(seen2)l = [1,2,3,2,1,5,6,5,5,5]*100

Here are the results: (well done @JohnLaRooy!)

$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'10000 loops, best of 3: 74.6 usec per loop$ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'10000 loops, best of 3: 91.3 usec per loop$ python -mtimeit -s 'import test' 'test.thg435(test.l)'1000 loops, best of 3: 266 usec per loop$ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'100 loops, best of 3: 8.35 msec per loop

Interestingly, besides the timings itself, also the ranking slightly changes when pypy is used. Most interestingly, the Counter-based approach benefits hugely from pypy's optimizations, whereas the method caching approach I have suggested seems to have almost no effect.

$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'100000 loops, best of 3: 17.8 usec per loop$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'10000 loops, best of 3: 23 usec per loop$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'10000 loops, best of 3: 39.3 usec per loop

Apparantly this effect is related to the "duplicatedness" of the input data. I have set l = [random.randrange(1000000) for i in xrange(10000)] and got these results:

$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'1000 loops, best of 3: 495 usec per loop$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'1000 loops, best of 3: 499 usec per loop$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'1000 loops, best of 3: 1.68 msec per loop