# How do I find the duplicates in a list and create another list with them?

## Question:

How do I find the duplicates in a list of integers and create another list of the duplicates?

To remove duplicates use `set(a)`. To print duplicates, something like:

``````a = [1,2,3,2,1,5,6,5,5,5]

import collections
print([item for item, count in collections.Counter(a).items() if count > 1])

## [1, 2, 5]
``````

Note that `Counter` is not particularly efficient (timings) and probably overkill here. `set` will perform better. This code computes a list of unique elements in the source order:

``````seen = set()
uniq = []
for x in a:
if x not in seen:
uniq.append(x)
``````

or, more concisely:

``````seen = set()
uniq = [x for x in a if x not in seen and not seen.add(x)]
``````

I don’t recommend the latter style, because it is not obvious what `not seen.add(x)` is doing (the set `add()` method always returns `None`, hence the need for `not`).

To compute the list of duplicated elements without libraries:

``````seen = set()
dupes = []

for x in a:
if x in seen:
dupes.append(x)
else:
``````

or, more concisely:

``````seen = set()
dupes = [x for x in a if x in seen or seen.add(x)]
``````

If list elements are not hashable, you cannot use sets/dicts and have to resort to a quadratic time solution (compare each with each). For example:

``````a = [, , , , , ]

no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]
print no_dupes # [, , , ]

dupes = [x for n, x in enumerate(a) if x in a[:n]]
print dupes # [, ]
``````

collections.Counter is new in python 2.7:

``````
Python 2.5.4 (r254:67916, May 31 2010, 15:03:39)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
a = [1,2,3,2,1,5,6,5,5,5]
import collections
print [x for x, y in collections.Counter(a).items() if y > 1]
File "", line 1, in
AttributeError: 'module' object has no attribute 'Counter'
>>>
``````

In an earlier version you can use a conventional dict instead:

``````a = [1,2,3,2,1,5,6,5,5,5]
d = {}
for elem in a:
if elem in d:
d[elem] += 1
else:
d[elem] = 1

print [x for x, y in d.items() if y > 1]
``````

You don’t need the count, just whether or not the item was seen before. Adapted that answer to this problem:

``````def list_duplicates(seq):
seen = set()
# adds all elements it doesn't know yet to seen and all other to seen_twice
seen_twice = set( x for x in seq if x in seen or seen_add(x) )
# turn the set into a list (as requested)
return list( seen_twice )

a = [1,2,3,2,1,5,6,5,5,5]
list_duplicates(a) # yields [1, 2, 5]
``````

Just in case speed matters, here are some timings:

``````# file: test.py
import collections

def thg435(l):
return [x for x, y in collections.Counter(l).items() if y > 1]

def moooeeeep(l):
seen = set()
# adds all elements it doesn't know yet to seen and all other to seen_twice
seen_twice = set( x for x in l if x in seen or seen_add(x) )
# turn the set into a list (as requested)
return list( seen_twice )

def RiteshKumar(l):
return list(set([x for x in l if l.count(x) > 1]))

def JohnLaRooy(L):
seen = set()
seen2 = set()
for item in L:
if item in seen:
else:
return list(seen2)

l = [1,2,3,2,1,5,6,5,5,5]*100
``````

Here are the results: (well done @JohnLaRooy!)

``````\$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
10000 loops, best of 3: 74.6 usec per loop
\$ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 91.3 usec per loop
\$ python -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 266 usec per loop
\$ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'
100 loops, best of 3: 8.35 msec per loop
``````

Interestingly, besides the timings itself, also the ranking slightly changes when pypy is used. Most interestingly, the Counter-based approach benefits hugely from pypy’s optimizations, whereas the method caching approach I have suggested seems to have almost no effect.

``````\$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
100000 loops, best of 3: 17.8 usec per loop
\$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
10000 loops, best of 3: 23 usec per loop
\$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 39.3 usec per loop
``````

Apparantly this effect is related to the “duplicatedness” of the input data. I have set `l = [random.randrange(1000000) for i in xrange(10000)]` and got these results:

``````\$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
1000 loops, best of 3: 495 usec per loop
\$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
1000 loops, best of 3: 499 usec per loop
\$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 1.68 msec per loop
``````

A very simple solution, but with complexity O(n*n).

``````>>> xs = [1,2,3,4,4,5,5,6,1]
>>> set([x for x in xs if xs.count(x) > 1])
set([1, 4, 5])
``````
``````list2 = [1, 2, 3, 4, 1, 2, 3]
lset = set()
for item in list2 if item not in lset]
print list(lset)
``````

A bit late, but maybe helpful for some.
For a largish list, I found this worked for me.

``````l=[1,2,3,5,4,1,3,1]
s=set(l)
d=[]
for x in l:
if x in s:
s.remove(x)
else:
d.append(x)
d
[1,3,1]
``````

Shows just and all duplicates and preserves order.

One line solution:

``````set([i for i in list if sum([1 for a in list if a == i]) > 1])
``````

I came across this question whilst looking in to something related – and wonder why no-one offered a generator based solution? Solving this problem would be:

``````>>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5]))
[1, 2, 5]
``````

I was concerned with scalability, so tested several approaches, including naive items that work well on small lists, but scale horribly as lists get larger (note- would have been better to use timeit, but this is illustrative).

I included @moooeeeep for comparison (it is impressively fast: fastest if the input list is completely random) and an itertools approach that is even faster again for mostly sorted lists… Now includes pandas approach from @firelynx — slow, but not horribly so, and simple. Note – sort/tee/zip approach is consistently fastest on my machine for large mostly ordered lists, moooeeeep is fastest for shuffled lists, but your mileage may vary.

• very quick simple to test for ‘any’ duplicates using the same code

Assumptions

• Duplicates should be reported once only
• Duplicate order does not need to be preserved
• Duplicate might be anywhere in the list

Fastest solution, 1m entries:

``````def getDupes(c):
'''sort/tee/izip'''
a, b = itertools.tee(sorted(c))
next(b, None)
r = None
for k, g in itertools.izip(a, b):
if k != g: continue
if k != r:
yield k
r = k
``````

Approaches tested

``````import itertools
import time
import random

def getDupes_1(c):
'''naive'''
for i in xrange(0, len(c)):
if c[i] in c[:i]:
yield c[i]

def getDupes_2(c):
'''set len change'''
s = set()
for i in c:
l = len(s)
if len(s) == l:
yield i

def getDupes_3(c):
'''in dict'''
d = {}
for i in c:
if i in d:
if d[i]:
yield i
d[i] = False
else:
d[i] = True

def getDupes_4(c):
'''in set'''
s,r = set(),set()
for i in c:
if i not in s:
elif i not in r:
yield i

def getDupes_5(c):
c = sorted(c)
r = None
for i in xrange(1, len(c)):
if c[i] == c[i - 1]:
if c[i] != r:
yield c[i]
r = c[i]

def getDupes_6(c):
'''sort/groupby'''
def multiple(x):
try:
x.next()
x.next()
return True
except:
return False
for k, g in itertools.ifilter(lambda x: multiple(x), itertools.groupby(sorted(c))):
yield k

def getDupes_7(c):
'''sort/zip'''
c = sorted(c)
r = None
for k, g in zip(c[:-1],c[1:]):
if k == g:
if k != r:
yield k
r = k

def getDupes_8(c):
'''sort/izip'''
c = sorted(c)
r = None
for k, g in itertools.izip(c[:-1],c[1:]):
if k == g:
if k != r:
yield k
r = k

def getDupes_9(c):
'''sort/tee/izip'''
a, b = itertools.tee(sorted(c))
next(b, None)
r = None
for k, g in itertools.izip(a, b):
if k != g: continue
if k != r:
yield k
r = k

def getDupes_a(l):
'''moooeeeep'''
seen = set()
# adds all elements it doesn't know yet to seen and all other to seen_twice
for x in l:
if x in seen or seen_add(x):
yield x

def getDupes_b(x):
'''iter*/sorted'''
x = sorted(x)
def _matches():
for k,g in itertools.izip(x[:-1],x[1:]):
if k == g:
yield k
for k, n in itertools.groupby(_matches()):
yield k

def getDupes_c(a):
'''pandas'''
import pandas as pd
vc = pd.Series(a).value_counts()
i = vc[vc > 1].index
for _ in i:
yield _

def hasDupes(fn,c):
try:
if fn(c).next(): return True    # Found a dupe
except StopIteration:
pass
return False

def getDupes(fn,c):
return list(fn(c))

STABLE = True
if STABLE:
print 'Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array'
else:
print 'Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array'
for location in (50,250000,500000,750000,999999):
for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6,
getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c):
print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location),
deltas = []
for FIRST in (True,False):
for i in xrange(0, 5):
c = range(0,1000000)
if STABLE:
c = location
else:
c.append(location)
random.shuffle(c)
start = time.time()
if FIRST:
print '.' if location == test(c).next() else '!',
else:
print '.' if [location] == list(test(c)) else '!',
deltas.append(time.time()-start)
print ' -- %0.3f  '%(sum(deltas)/len(deltas)),
print
print
``````

The results for the ‘all dupes’ test were consistent, finding “first” duplicate then “all” duplicates in this array:

``````Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array
Test set len change :    500000 -  . . . . .  -- 0.264   . . . . .  -- 0.402
Test in dict        :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.250
Test in set         :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.249
Test sort/adjacent  :    500000 -  . . . . .  -- 0.159   . . . . .  -- 0.229
Test sort/groupby   :    500000 -  . . . . .  -- 0.860   . . . . .  -- 1.286
Test sort/izip      :    500000 -  . . . . .  -- 0.165   . . . . .  -- 0.229
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.145   . . . . .  -- 0.206  *
Test moooeeeep      :    500000 -  . . . . .  -- 0.149   . . . . .  -- 0.232
Test iter*/sorted   :    500000 -  . . . . .  -- 0.160   . . . . .  -- 0.221
Test pandas         :    500000 -  . . . . .  -- 0.493   . . . . .  -- 0.499
``````

When the lists are shuffled first, the price of the sort becomes apparent – the efficiency drops noticeably and the @moooeeeep approach dominates, with set & dict approaches being similar but lessor performers:

``````Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array
Test set len change :    500000 -  . . . . .  -- 0.321   . . . . .  -- 0.473
Test in dict        :    500000 -  . . . . .  -- 0.285   . . . . .  -- 0.360
Test in set         :    500000 -  . . . . .  -- 0.309   . . . . .  -- 0.365
Test sort/adjacent  :    500000 -  . . . . .  -- 0.756   . . . . .  -- 0.823
Test sort/groupby   :    500000 -  . . . . .  -- 1.459   . . . . .  -- 1.896
Test sort/izip      :    500000 -  . . . . .  -- 0.786   . . . . .  -- 0.845
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.743   . . . . .  -- 0.804
Test moooeeeep      :    500000 -  . . . . .  -- 0.234   . . . . .  -- 0.311  *
Test iter*/sorted   :    500000 -  . . . . .  -- 0.776   . . . . .  -- 0.840
Test pandas         :    500000 -  . . . . .  -- 0.539   . . . . .  -- 0.540
``````

I would do this with pandas, because I use pandas a lot

``````import pandas as pd
a = [1,2,3,3,3,4,5,6,6,7]
vc = pd.Series(a).value_counts()
vc[vc > 1].index.tolist()
``````

Gives

``````[3,6]
``````

Probably isn’t very efficient, but it sure is less code than a lot of the other answers, so I thought I would contribute

this is the way I had to do it because I challenged myself not to use other methods:

``````def dupList(oldlist):
if type(oldlist)==type((2,2)):
oldlist=[x for x in oldlist]
newList=[]
newList=newList+oldlist
oldlist=oldlist
forbidden=[]
checkPoint=0
for i in range(len(oldlist)):
#print 'start i', i
if i in forbidden:
continue
else:
for j in range(len(oldlist)):
#print 'start j', j
if j in forbidden:
continue
else:
#print 'after Else'
if i!=j:
#print 'i,j', i,j
#print oldlist
#print newList
if oldlist[j]==oldlist[i]:
#print 'oldlist[i],oldlist[j]', oldlist[i],oldlist[j]
forbidden.append(j)
#print 'forbidden', forbidden
del newList[j-checkPoint]
#print newList
checkPoint=checkPoint+1
return newList
``````

so your sample works as:

``````>>>a = [1,2,3,3,3,4,5,6,6,7]
>>>dupList(a)
[1, 2, 3, 4, 5, 6, 7]
``````

Use the `sort()` function. Duplicates can be identified by looping over it and checking `l1[i] == l1[i+1]`.

the third example of the accepted answer give an erroneous answer and does not attempt to give duplicates. Here is the correct version :

``````number_lst = [1, 1, 2, 3, 5, ...]

seen_set = set()
duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
unique_set = seen_set - duplicate_set
``````

There are a lot of answers up here, but I think this is relatively a very readable and easy to understand approach:

``````def get_duplicates(sorted_list):
duplicates = []
last = sorted_list
for x in sorted_list[1:]:
if x == last:
duplicates.append(x)
last = x
return set(duplicates)
``````

Notes:

• If you wish to preserve duplication count, get rid of the cast
to ‘set’ at the bottom to get the full list
• If you prefer to use generators, replace duplicates.append(x) with yield x and the return statement at the bottom (you can cast to set later)

Here’s a fast generator that uses a dict to store each element as a key with a boolean value for checking if the duplicate item has already been yielded.

For lists with all elements that are hashable types:

``````def gen_dupes(array):
unique = {}
for value in array:
if value in unique and unique[value]:
unique[value] = False
yield value
else:
unique[value] = True

array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, 1, 6]
``````

For lists that might contain lists:

``````def gen_dupes(array):
unique = {}
for value in array:
is_list = False
if type(value) is list:
value = tuple(value)
is_list = True

if value in unique and unique[value]:
unique[value] = False
if is_list:
value = list(value)

yield value
else:
unique[value] = True

array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, [1, 2], 6]
``````

Very simple and quick way of finding dupes with one iteration in Python is:

``````testList = ['red', 'blue', 'red', 'green', 'blue', 'blue']

testListDict = {}

for item in testList:
try:
testListDict[item] += 1
except:
testListDict[item] = 1

print testListDict
``````

Output will be as follows:

``````>>> print testListDict
{'blue': 3, 'green': 1, 'red': 2}
``````

This and more in my blog http://www.howtoprogramwithpython.com

Here’s a neat and concise solution –

``````for x in set(li):
li.remove(x)

li = list(set(li))
``````

Using pandas:

``````>>> import pandas as pd
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> pd.Series(a)[pd.Series(a).duplicated()].values
array([1, 3, 3])
``````

You can use `iteration_utilities.duplicates`:

``````>>> from iteration_utilities import duplicates

>>> list(duplicates([1,1,2,1,2,3,4,2]))
[1, 1, 2, 2]
``````

or if you only want one of each duplicate this can be combined with `iteration_utilities.unique_everseen`:

``````>>> from iteration_utilities import unique_everseen

>>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2])))
[1, 2]
``````

It can also handle unhashable elements (however at the cost of performance):

``````>>> list(duplicates([, , , , ]))
[, ]

>>> list(unique_everseen(duplicates([, , , , ])))
[]
``````

That’s something that only a few of the other approaches here can handle.

### Benchmarks

I did a quick benchmark containing most (but not all) of the approaches mentioned here.

The first benchmark included only a small range of list-lengths because some approaches have `O(n**2)` behavior.

In the graphs the y-axis represents the time, so a lower value means better. It’s also plotted log-log so the wide range of values can be visualized better: Removing the `O(n**2)` approaches I did another benchmark up to half a million elements in a list: As you can see the `iteration_utilities.duplicates` approach is faster than any of the other approaches and even chaining `unique_everseen(duplicates(...))` was faster or equally fast than the other approaches.

One additional interesting thing to note here is that the pandas approaches are very slow for small lists but can easily compete for longer lists.

However as these benchmarks show most of the approaches perform roughly equally, so it doesn’t matter much which one is used (except for the 3 that had `O(n**2)` runtime).

``````from iteration_utilities import duplicates, unique_everseen
from collections import Counter
import pandas as pd
import itertools

def georg_counter(it):
return [item for item, count in Counter(it).items() if count > 1]

def georg_set(it):
seen = set()
uniq = []
for x in it:
if x not in seen:
uniq.append(x)

def georg_set2(it):
seen = set()
return [x for x in it if x not in seen and not seen.add(x)]

def georg_set3(it):
seen = {}
dupes = []

for x in it:
if x not in seen:
seen[x] = 1
else:
if seen[x] == 1:
dupes.append(x)
seen[x] += 1

def RiteshKumar_count(l):
return set([x for x in l if l.count(x) > 1])

def moooeeeep(seq):
seen = set()
# adds all elements it doesn't know yet to seen and all other to seen_twice
seen_twice = set( x for x in seq if x in seen or seen_add(x) )
# turn the set into a list (as requested)
return list( seen_twice )

def F1Rumors_implementation(c):
a, b = itertools.tee(sorted(c))
next(b, None)
r = None
for k, g in zip(a, b):
if k != g: continue
if k != r:
yield k
r = k

def F1Rumors(c):
return list(F1Rumors_implementation(c))

def Edward(a):
d = {}
for elem in a:
if elem in d:
d[elem] += 1
else:
d[elem] = 1
return [x for x, y in d.items() if y > 1]

def wordsmith(a):
return pd.Series(a)[pd.Series(a).duplicated()].values

def NikhilPrabhu(li):
li = li.copy()
for x in set(li):
li.remove(x)

return list(set(li))

def firelynx(a):
vc = pd.Series(a).value_counts()
return vc[vc > 1].index.tolist()

def HenryDev(myList):
newList = set()

for i in myList:
if myList.count(i) >= 2:

return list(newList)

def yota(number_lst):
seen_set = set()
duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
return seen_set - duplicate_set

def IgorVishnevskiy(l):
s=set(l)
d=[]
for x in l:
if x in s:
s.remove(x)
else:
d.append(x)
return d

def it_duplicates(l):
return list(duplicates(l))

def it_unique_duplicates(l):
return list(unique_everseen(duplicates(l)))
``````

### Benchmark 1

``````from simple_benchmark import benchmark
import random

funcs = [
georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep,
F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx,
HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 12)}

b = benchmark(funcs, args, 'list size')

b.plot()
``````

### Benchmark 2

``````funcs = [
georg_counter, georg_set, georg_set2, georg_set3, moooeeeep,
F1Rumors, Edward, wordsmith, firelynx,
yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 20)}

b = benchmark(funcs, args, 'list size')
b.plot()
``````

### Disclaimer

1 This is from a third-party library I have written: `iteration_utilities`.

How about simply loop through each element in the list by checking the number of occurrences, then adding them to a set which will then print the duplicates. Hope this helps someone out there.

``````myList  = [2 ,4 , 6, 8, 4, 6, 12];
newList = set()

for i in myList:
if myList.count(i) >= 2:

print(list(newList))
## [4 , 6]
``````

We can use `itertools.groupby` in order to find all the items that have dups:

``````from itertools import groupby

myList  = [2, 4, 6, 8, 4, 6, 12]
# when the list is sorted, groupby groups by consecutive elements which are similar
for x, y in groupby(sorted(myList)):
#  list(y) returns all the occurences of item x
if len(list(y)) > 1:
print x
``````

The output will be:

``````4
6
``````
``````def removeduplicates(a):
seen = set()

for i in a:
if i not in seen:
return seen

print(removeduplicates([1,1,2,2]))
``````

Without converting to list and probably the simplest way would be something like below.
This may be useful during a interview when they ask not to use sets

``````a=[1,2,3,3,3]
dup=[]
for each in a:
if each not in dup:
dup.append(each)
print(dup)
``````

======= else to get 2 separate lists of unique values and duplicate values

``````a=[1,2,3,3,3]
uniques=[]
dups=[]

for each in a:
if each not in uniques:
uniques.append(each)
else:
dups.append(each)
print("Unique values are below:")
print(uniques)
print("Duplicate values are below:")
print(dups)
``````

Some other tests. Of course to do…

``````set([x for x in l if l.count(x) > 1])
``````

…is too costly. It’s about 500 times faster (the more long array gives better results) to use the next final method:

``````def dups_count_dict(l):
d = {}

for item in l:
if item not in d:
d[item] = 0

d[item] += 1

result_d = {key: val for key, val in d.iteritems() if val > 1}

return result_d.keys()
``````

Only 2 loops, no very costly `l.count()` operations.

Here is a code to compare the methods for example. The code is below, here is the output:

``````dups_count: 13.368s # this is a function which uses l.count()
dups_count_dict: 0.014s # this is a final best function (of the 3 functions)
dups_count_counter: 0.024s # collections.Counter
``````

The testing code:

``````import numpy as np
from time import time
from collections import Counter

class TimerCounter(object):
def __init__(self):
self._time_sum = 0

def start(self):
self.time = time()

def stop(self):
self._time_sum += time() - self.time

def get_time_sum(self):
return self._time_sum

def dups_count(l):
return set([x for x in l if l.count(x) > 1])

def dups_count_dict(l):
d = {}

for item in l:
if item not in d:
d[item] = 0

d[item] += 1

result_d = {key: val for key, val in d.iteritems() if val > 1}

return result_d.keys()

def dups_counter(l):
counter = Counter(l)

result_d = {key: val for key, val in counter.iteritems() if val > 1}

return result_d.keys()

def gen_array():
np.random.seed(17)
return list(np.random.randint(0, 5000, 10000))

def assert_equal_results(*results):
primary_result = results
other_results = results[1:]

for other_result in other_results:
assert set(primary_result) == set(other_result) and len(primary_result) == len(other_result)

if __name__ == '__main__':
dups_count_time = TimerCounter()
dups_count_dict_time = TimerCounter()
dups_count_counter = TimerCounter()

l = gen_array()

for i in range(3):
dups_count_time.start()
result1 = dups_count(l)
dups_count_time.stop()

dups_count_dict_time.start()
result2 = dups_count_dict(l)
dups_count_dict_time.stop()

dups_count_counter.start()
result3 = dups_counter(l)
dups_count_counter.stop()

assert_equal_results(result1, result2, result3)

print 'dups_count: %.3f' % dups_count_time.get_time_sum()
print 'dups_count_dict: %.3f' % dups_count_dict_time.get_time_sum()
print 'dups_count_counter: %.3f' % dups_count_counter.get_time_sum()
``````

When using toolz:

``````from toolz import frequencies, valfilter

a = [1,2,2,3,4,5,4]
>>> list(valfilter(lambda count: count > 1, frequencies(a)).keys())
[2,4]
``````

I am entering much much late in to this discussion. Even though, I would like to deal with this problem with one liners . Because that’s the charm of Python.
if we just want to get the duplicates in to a separate list (or any collection),I would suggest to do as below.Say we have a duplicated list which we can call as ‘target’

``````    target=[1,2,3,4,4,4,3,5,6,8,4,3]
``````

Now if we want to get the duplicates,we can use the one liner as below:

``````    duplicates=dict(set((x,target.count(x)) for x in filter(lambda rec : target.count(rec)>1,target)))
``````

This code will put the duplicated records as key and count as value in to the dictionary ‘duplicates’.’duplicate’ dictionary will look like as below:

``````    {3: 3, 4: 4} #it saying 3 is repeated 3 times and 4 is 4 times
``````

If you just want all the records with duplicates alone in a list, its again much shorter code:

``````    duplicates=filter(lambda rec : target.count(rec)>1,target)
``````

Output will be:

``````    [3, 4, 4, 4, 3, 4, 3]
``````

This works perfectly in python 2.7.x + versions

Method 1:

``````list(set([val for idx, val in enumerate(input_list) if val in input_list[idx+1:]]))
``````

Explanation:
[val for idx, val in enumerate(input_list) if val in input_list[idx+1:]] is a list comprehension, that returns an element, if the same element is present from it’s current position, in list, the index.

Example:
input_list = [42,31,42,31,3,31,31,5,6,6,6,6,6,7,42]

starting with the first element in list, 42, with index 0, it checks if the element 42, is present in input_list[1:] (i.e., from index 1 till end of list)
Because 42 is present in input_list[1:], it will return 42.

Then it goes to the next element 31, with index 1, and checks if element 31 is present in the input_list[2:] (i.e., from index 2 till end of list),
Because 31 is present in input_list[2:], it will return 31.

similarly it goes through all the elements in the list, and will return only the repeated/duplicate elements into a list.

Then because we have duplicates, in a list, we need to pick one of each duplicate, i.e. remove duplicate among duplicates, and to do so, we do call a python built-in named set(), and it removes the duplicates,

Then we are left with a set, but not a list, and hence to convert from a set to list, we use, typecasting, list(), and that converts the set of elements to a list.

Method 2:

``````def dupes(ilist):
temp_list = [] # initially, empty temporary list
dupe_list = [] # initially, empty duplicate list
for each in ilist:
if each in temp_list: # Found a Duplicate element
if not each in dupe_list: # Avoid duplicate elements in dupe_list
dupe_list.append(each) # Add duplicate element to dupe_list
else:
temp_list.append(each) # Add a new (non-duplicate) to temp_list

return dupe_list
``````

Explanation:
Here We create two empty lists, to start with.
Then keep traversing through all the elements of the list, to see if it exists in temp_list (initially empty). If it is not there in the temp_list, then we add it to the temp_list, using append method.

If it already exists in temp_list, it means, that the current element of the list is a duplicate, and hence we need to add it to dupe_list using append method.

``````raw_list = [1,2,3,3,4,5,6,6,7,2,3,4,2,3,4,1,3,4,]

clean_list = list(set(raw_list))
duplicated_items = []

for item in raw_list:
try:
clean_list.remove(item)
except ValueError:
duplicated_items.append(item)

print(duplicated_items)
# [3, 6, 2, 3, 4, 2, 3, 4, 1, 3, 4]
``````

You basically remove duplicates by converting to set (`clean_list`), then iterate the `raw_list`, while removing each `item` in the clean list for occurrence in `raw_list`. If `item` is not found, the raised `ValueError` Exception is caught and the `item` is added to `duplicated_items` list.

If the index of duplicated items is needed, just `enumerate` the list and play around with the index. (`for index, item in enumerate(raw_list):`) which is faster and optimised for large lists (like thousands+ of elements)

use of `list.count()` method in the list to find out the duplicate elements of a given list

``````arr=[]
dup =[]
for i in range(int(input("Enter range of list: "))):
arr.append(int(input("Enter Element in a list: ")))
for i in arr:
if arr.count(i)>1 and i not in dup:
dup.append(i)
print(dup)
``````

one-liner, for fun, and where a single statement is required.

``````(lambda iterable: reduce(lambda (uniq, dup), item: (uniq, dup | {item}) if item in uniq else (uniq | {item}, dup), iterable, (set(), set())))(some_iterable)
``````

I guess the most effective way to find duplicates in a list is:

``````from collections import Counter

def duplicates(values):
dups = Counter(values) - Counter(set(values))
return list(dups.keys())

print(duplicates([1,2,3,6,5,2]))
``````

It uses `Counter` once on all the elements, and then on all unique elements. Subtracting the first one with the second will leave out the duplicates only.

Using Set Function
eg:-

``````arr=[1,4,2,5,2,3,4,1,4,5,2,3]
arr2=list(set(arr))
print(arr2)
``````

Output:- [1, 2, 3, 4, 5]

1. Remove Duplicates using array

eg:-

``````arr=[1,4,2,5,2,3,4,1,4,5,2,3]
arr3=[]
for i in arr:
if(i not in arr3):
arr3.append(i)
print(arr3)
``````

Output:-

[1, 4, 2, 5, 3]

1. Using Lambda Function

eg:-

``````rem_duplicate_func=lambda arr:set(arr)
print(rem_duplicate_func(arr))
``````

Output:-

{1, 2, 3, 4, 5}

1. Remove Duplicate Values from dictionary

eg:-

``````dict1={
'car':["Ford","Toyota","Ford","Toyota"],
'brand':["Mustang","Ranz","Mustang","Ranz"] } dict2={} for key,value in dict1.items():
dict2[key]=set(value) print(dict2)
``````

Output:-

{‘car’: {‘Toyota’, ‘Ford’}, ‘brand’: {‘Ranz’, ‘Mustang’}}

1. Symmetric difference- Remove duplicate elements

eg:-

``````set1={1,2,4,5}
set2={2,1,5,7}
rem_dup_ele=set1.symmetric_difference(set2)
print(rem_dup_ele)
``````

Output:-

{4, 7}

Python 3.8 one-liner if you don’t care to write your own algorithm or use libraries:

``````l = [1,2,3,2,1,5,6,5,5,5]

res = [(x, count) for x, g in groupby(sorted(l)) if (count := len(list(g))) > 1]

print(res)
``````

Prints item and count:

``````[(1, 2), (2, 2), (5, 4)]
``````

`groupby` takes a grouping function so you can define your groupings in different ways and return additional `Tuple` fields as needed.

I do not see a solution that is purely using iterators, so here we go

This requires the list to be sorted, which may be the drawback here.

``````a = [1,2,3,2,1,5,6,5,5,5]
a.sort()
set(map(lambda x: x, filter(lambda x: x == x, zip(a, a[1:]))))

{1, 2, 5}
``````

You can easily check how fast this is on your machine with a million potential duplicates with this piece of code:

First generate the data

``````import random
from itertools import chain
a = list(chain(*[[n] * random.randint(1, 2) for n in range(1000000)]))
``````

And run the test:

``````set(map(lambda x: x, filter(lambda x: x == x, zip(a, a[1:]))))
``````

Needless to say, this solution is only good if your list is already sorted.

Try this For check duplicates

``````>>> def checkDuplicate(List):
duplicate={}
for i in List:
## checking whether the item is already present in dictionary or not
## increasing count if present
## initializing count to 1 if not present

duplicate[i]=duplicate.get(i,0)+1

return [k for k,v in duplicate.items() if v>1]

>>> checkDuplicate([1,2,3,"s",1,2,3])
[1, 2, 3]
``````

This seems somewhat competitive despite its O(n log n) complexity, see benchmarks below.

``````a = sorted(a)
dupes = list(set(a[::2]) & set(a[1::2]))
``````

Sorting brings duplicates next to each other, so they’re both at an even index and at an odd index. Unique values are only at an even or at an odd index, not both. So the intersection of even-index values and odd-index values is the duplicates.

Benchmark results: This uses MSeifert’s benchmark, but only with the solutions from the accepted answer (the georgs), the slowest solutions, the fastest solution (excluding `it_duplicates` as it doesn’t uniquify the duplicates), and mine. Otherwise it’d be too crowded and the colors too similar.

First line could be `a.sort()` if we’re allowed to modify the given list, that would be a bit faster. But the benchmark re-uses the same list multiple times, so modifying it would mess with the benchmark.

And apparently `set(a[::2]).intersection(a[1::2])` wouldn’t create a second set and be a bit faster, but meh, it’s also a bit longer.

Without help of any data structure of python you can simply try the following code of mine. This will work for finding duplicates of various kind of input like string,list etc.

``````# finding duplicates in unsorted an array
def duplicates(numbers):
store=[]
checked=[]
for i in range(len(numbers)):
counter =1
for j in range(i+1,len(numbers)):
if numbers[i] not in checked and numbers[j]==numbers[i] :
counter +=1
if counter > 1 :
store.append(numbers[i])
checked.append(numbers[i])
return store

print(duplicates([1,2,2,3,3,3,4,4,5]))  # output:  [2, 3, 4]
print(duplicates("madam"))              # output:  ['m', 'a']
``````

Simply check, for all list items, if the first index of an item is equal to the last index of that item:

``````>>> lastindex = lambda arr, el: len(arr) - arr[::-1].index(el) -1
>>> is_duplicate  = lambda arr, el: arr.index(el) != lastindex(arr, el)
>>> duplicates = lambda arr: [*set(x for x in arr if is_duplicate(arr, x))]
>>>
>>> a=[2,3,5,7,11,13, 2,17,7,7,17,18,3,19,5,2,7,48,48,2,19]
>>> duplicates(a)
[2, 3, 5, 7, 48, 17, 19]
>>>
``````

So assuming we have this list of elements:

``````a = [1, 2, 3, 2, 1, 5, 6, 5, 5, 5]
``````

We can work just with sets in order to find the unique elements:

``````unique = set()
for num in a:
if num not in unique:
else:
unique = unique - set([num])
``````

and finally:

``````>>> unique
{3, 6}
``````

If you want to get the duplicates you can simply do:

``````>>> duplicates = set(a) - unique
>>> duplicates
{1, 2, 5}
``````

Notes:

• Element lookup in sets is O(1)
• Element removal from sets is O(1)
``````some_list = ['a', 'b', 'c', 'b', 'd', 'm', 'n', 'n']
some_dictionary = {}

for element in some_list:
if element not in some_dictionary:
some_dictionary[element] = 1
else:
some_dictionary[element] += 1

for key, value in some_dictionary.items():
if value > 1:
print(key, end = ' ')

# another way
duplicates = []

for x in some_list:
if some_list.count(x) > 1 and x not in duplicates:
duplicates.append(x)

print()
print(duplicates)
``````

Source: HERE

One other solution is as following without using any collection library.

``````a = [1,2,3,5,4,6,4,21,4,6,3,32,5,2,23,5]
duplicates = []

for i in a:
if a.count(i) > 1 and i not in duplicates:
duplicates.append(i)

print(duplicates)
``````

The output is `[2, 3, 5, 4, 6]`

I noticed that most solutions have a complexity of O(n * n) and are very slow for large lists. So I thought I’d share the function I wrote which supports integers or strings and is O(n) in the best case scenario. For a list of 100k elements, the top solution took over 30s whereas mine finishes in .12s

``````def get_duplicates(list1):
'''Return all duplicates given a list. O(n) complexity for best case scenario.
input: [1, 1, 1, 2, 3, 4, 4]
output: [1, 1, 4]
'''
dic = {}
for el in list1:
try:
dic[el] += 1
except:
dic[el] = 1
dupes = []
for key in dic.keys():
for i in range(dic[key] - 1):
dupes.append(key)
return dupes

list1 = [1, 1, 1, 2, 3, 4, 4]
> print(get_duplicates(list1))
[1, 1, 4]
``````

Or to get unique duplicates:

``````> print(list(set(get_duplicates(list1))))
[1, 4]
``````

To achieve this problem we can go with multiple different methods to solve it, these two are common solutions but while implementing them in real scenarios we have to think about time complexity also.

``````import random
import time

dupl_list = [random.randint(1,1000) for x in range(500)]
print("List with duplicate integers")
print (dupl_list)

#Method 1
print("******************Method 1 *************")

def Repeat_num(x):
_size = len(x)
repeated = []
for i in range(_size):
# print(i)
k = i + 1
for j in range(k, _size):
# print(j)
if x[i] == x[j] and x[i] not in repeated:
repeated.append(x[i])
return repeated

start = time.time()
print(Repeat_num(dupl_list))
end = time.time()
print("The time of execution of above program is :",(end-start) * 10**3, "ms")

print("***************Method 2****************")

#method 2 - using count()
def repeast_count(dup_list):
new = []
for a in dup_list:
# print(a)
# checking the occurrence of elements
n = dup_list.count(a)
# if the occurrence is more than
# one we add it to the output list
if n > 1:
if new.count(a) == 0:  # condition to check
new.append(a)
return new

start = time.time()
print(repeast_count(dupl_list))
end = time.time()
print("The time of execution of above program is :",(end-start) * 10**3, "ms")
``````

##Sample output::

``````List with duplicate integers
[5, 45, 28, 81, 32, 98, 8, 83, 47, 95, 41, 49, 4, 1, 85, 26, 38, 82, 54, 11]
******************Method 1 *************
[]
The time of execution of above program is : 1.1069774627685547 ms
***************Method 2****************
[]
The time of execution of above program is : 0.1881122589111328 ms
``````

For the general understanding, method 1 is good, but for real implementation, I will prefer method 2 as it is taking less time than method 1.

Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.