How do I find the duplicates in a list and create another list with them? How do I find the duplicates in a list and create another list with them? python python

# How do I find the duplicates in a list and create another list with them?

To remove duplicates use `set(a)`. To print duplicates, something like:

``a = [1,2,3,2,1,5,6,5,5,5]import collectionsprint([item for item, count in collections.Counter(a).items() if count > 1])## [1, 2, 5]``

Note that `Counter` is not particularly efficient (timings) and probably overkill here. `set` will perform better. This code computes a list of unique elements in the source order:

``seen = set()uniq = []for x in a:    if x not in seen:        uniq.append(x)        seen.add(x)``

or, more concisely:

``seen = set()uniq = [x for x in a if x in seen or seen.add(x)]    ``

I don't recommend the latter style, because it is not obvious what `not seen.add(x)` is doing (the set `add()` method always returns `None`, hence the need for `not`).

To compute the list of duplicated elements without libraries:

``seen = {}dupes = []for x in a:    if x not in seen:        seen[x] = 1    else:        if seen[x] == 1:            dupes.append(x)        seen[x] += 1``

If list elements are not hashable, you cannot use sets/dicts and have to resort to a quadratic time solution (compare each with each). For example:

``a = [, , , , , ]no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]print no_dupes # [, , , ]dupes = [x for n, x in enumerate(a) if x in a[:n]]print dupes # [, ]``

``>>> l = [1,2,3,4,4,5,5,6,1]>>> set([x for x in l if l.count(x) > 1])set([1, 4, 5])``

You don't need the count, just whether or not the item was seen before. Adapted that answer to this problem:

``def list_duplicates(seq):  seen = set()  seen_add = seen.add  # adds all elements it doesn't know yet to seen and all other to seen_twice  seen_twice = set( x for x in seq if x in seen or seen_add(x) )  # turn the set into a list (as requested)  return list( seen_twice )a = [1,2,3,2,1,5,6,5,5,5]list_duplicates(a) # yields [1, 2, 5]``

Just in case speed matters, here are some timings:

``# file: test.pyimport collectionsdef thg435(l):    return [x for x, y in collections.Counter(l).items() if y > 1]def moooeeeep(l):    seen = set()    seen_add = seen.add    # adds all elements it doesn't know yet to seen and all other to seen_twice    seen_twice = set( x for x in l if x in seen or seen_add(x) )    # turn the set into a list (as requested)    return list( seen_twice )def RiteshKumar(l):    return list(set([x for x in l if l.count(x) > 1]))def JohnLaRooy(L):    seen = set()    seen2 = set()    seen_add = seen.add    seen2_add = seen2.add    for item in L:        if item in seen:            seen2_add(item)        else:            seen_add(item)    return list(seen2)l = [1,2,3,2,1,5,6,5,5,5]*100``

Here are the results: (well done @JohnLaRooy!)

``\$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'10000 loops, best of 3: 74.6 usec per loop\$ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'10000 loops, best of 3: 91.3 usec per loop\$ python -mtimeit -s 'import test' 'test.thg435(test.l)'1000 loops, best of 3: 266 usec per loop\$ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'100 loops, best of 3: 8.35 msec per loop``

Interestingly, besides the timings itself, also the ranking slightly changes when pypy is used. Most interestingly, the Counter-based approach benefits hugely from pypy's optimizations, whereas the method caching approach I have suggested seems to have almost no effect.

``\$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'100000 loops, best of 3: 17.8 usec per loop\$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'10000 loops, best of 3: 23 usec per loop\$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'10000 loops, best of 3: 39.3 usec per loop``

Apparantly this effect is related to the "duplicatedness" of the input data. I have set `l = [random.randrange(1000000) for i in xrange(10000)]` and got these results:

``\$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'1000 loops, best of 3: 495 usec per loop\$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'1000 loops, best of 3: 499 usec per loop\$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'1000 loops, best of 3: 1.68 msec per loop`` 