How do I find the duplicates in a list and create another list with them?
To remove duplicates use set(a)
. To print duplicates, something like:
a = [1,2,3,2,1,5,6,5,5,5]import collectionsprint([item for item, count in collections.Counter(a).items() if count > 1])## [1, 2, 5]
Note that Counter
is not particularly efficient (timings) and probably overkill here. set
will perform better. This code computes a list of unique elements in the source order:
seen = set()uniq = []for x in a: if x not in seen: uniq.append(x) seen.add(x)
or, more concisely:
seen = set()uniq = [x for x in a if x in seen or seen.add(x)]
I don't recommend the latter style, because it is not obvious what not seen.add(x)
is doing (the set add()
method always returns None
, hence the need for not
).
To compute the list of duplicated elements without libraries:
seen = {}dupes = []for x in a: if x not in seen: seen[x] = 1 else: if seen[x] == 1: dupes.append(x) seen[x] += 1
If list elements are not hashable, you cannot use sets/dicts and have to resort to a quadratic time solution (compare each with each). For example:
a = [[1], [2], [3], [1], [5], [3]]no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]print no_dupes # [[1], [2], [3], [5]]dupes = [x for n, x in enumerate(a) if x in a[:n]]print dupes # [[1], [3]]
>>> l = [1,2,3,4,4,5,5,6,1]>>> set([x for x in l if l.count(x) > 1])set([1, 4, 5])
You don't need the count, just whether or not the item was seen before. Adapted that answer to this problem:
def list_duplicates(seq): seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice seen_twice = set( x for x in seq if x in seen or seen_add(x) ) # turn the set into a list (as requested) return list( seen_twice )a = [1,2,3,2,1,5,6,5,5,5]list_duplicates(a) # yields [1, 2, 5]
Just in case speed matters, here are some timings:
# file: test.pyimport collectionsdef thg435(l): return [x for x, y in collections.Counter(l).items() if y > 1]def moooeeeep(l): seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice seen_twice = set( x for x in l if x in seen or seen_add(x) ) # turn the set into a list (as requested) return list( seen_twice )def RiteshKumar(l): return list(set([x for x in l if l.count(x) > 1]))def JohnLaRooy(L): seen = set() seen2 = set() seen_add = seen.add seen2_add = seen2.add for item in L: if item in seen: seen2_add(item) else: seen_add(item) return list(seen2)l = [1,2,3,2,1,5,6,5,5,5]*100
Here are the results: (well done @JohnLaRooy!)
$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'10000 loops, best of 3: 74.6 usec per loop$ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'10000 loops, best of 3: 91.3 usec per loop$ python -mtimeit -s 'import test' 'test.thg435(test.l)'1000 loops, best of 3: 266 usec per loop$ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'100 loops, best of 3: 8.35 msec per loop
Interestingly, besides the timings itself, also the ranking slightly changes when pypy is used. Most interestingly, the Counter-based approach benefits hugely from pypy's optimizations, whereas the method caching approach I have suggested seems to have almost no effect.
$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'100000 loops, best of 3: 17.8 usec per loop$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'10000 loops, best of 3: 23 usec per loop$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'10000 loops, best of 3: 39.3 usec per loop
Apparantly this effect is related to the "duplicatedness" of the input data. I have set l = [random.randrange(1000000) for i in xrange(10000)]
and got these results:
$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'1000 loops, best of 3: 495 usec per loop$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'1000 loops, best of 3: 499 usec per loop$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'1000 loops, best of 3: 1.68 msec per loop