Determining duplicate values in an array
Question:
Suppose I have an array
a = np.array([1, 2, 1, 3, 3, 3, 0])
How can I (efficiently, Pythonically) find which elements of a
are duplicates (i.e., non-unique values)? In this case the result would be array([1, 3, 3])
or possibly array([1, 3])
if efficient.
I’ve come up with a few methods that appear to work:
Masking
m = np.zeros_like(a, dtype=bool)
m[np.unique(a, return_index=True)[1]] = True
a[~m]
Set operations
a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]
This one is cute but probably illegal (as a
isn’t actually unique):
np.setxor1d(a, np.unique(a), assume_unique=True)
Histograms
u, i = np.unique(a, return_inverse=True)
u[np.bincount(i) > 1]
Sorting
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
Pandas
s = pd.Series(a)
s[s.duplicated()]
Is there anything I’ve missed? I’m not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).
Conclusions
Testing with a 10 million size data set (on a 2.8GHz Xeon):
a = np.random.randint(10**7, size=10**7)
The fastest is sorting, at 1.1s. The dubious xor1d
is second at 2.6s, followed by masking and Pandas Series.duplicated
at 3.1s, bincount
at 5.6s, and in1d
and senderle’s setdiff1d
both at 7.3s. Steven’s Counter
is only a little slower, at 10.5s; trailing behind are Burhan’s Counter.most_common
at 110s and DSM’s Counter
subtraction at 360s.
I’m going to use sorting for performance, but I’m accepting Steven’s answer because the performance is acceptable and it feels clearer and more Pythonic.
Edit: discovered the Pandas solution. If Pandas is available it’s clear and performs well.
Answers:
For Python 2.7+
>>> import numpy
>>> from collections import Counter
>>> n = numpy.array([1,1,2,3,3,3,0])
>>> [x[1] for x in Counter(n).most_common() if x[0] > 1]
[3, 1]
I think this is most clear done outside of numpy
. You’ll have to time it against your numpy
solutions if you are concerned with speed.
>>> import numpy as np
>>> from collections import Counter
>>> a = np.array([1, 2, 1, 3, 3, 3, 0])
>>> [item for item, count in Counter(a).items() if count > 1]
[1, 3]
note: This is similar to Burhan Khalid’s answer, but the use of items
without subscripting in the condition should be faster.
Here’s another approach using set operations that I think is a bit more straightforward than the ones you offer:
>>> indices = np.setdiff1d(np.arange(len(a)), np.unique(a, return_index=True)[1])
>>> a[indices]
array([1, 3, 3])
I suppose you’re asking for numpy
-only solutions, since if that’s not the case, it’s very difficult to argue with just using a Counter
instead. I think you should make that requirement explicit though.
If a
is made up of small integers you can use numpy.bincount directly:
import numpy as np
a = np.array([3, 2, 2, 0, 4, 3])
counts = np.bincount(a)
print np.where(counts > 1)[0]
# array([2, 3])
This is very similar your “histogram” method, which is the one I would use if a
was not made up of small integers.
People have already suggested Counter
variants, but here’s one which doesn’t use a listcomp:
>>> from collections import Counter
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> (Counter(a) - Counter(set(a))).keys()
[1, 3]
[Posted not because it’s efficient — it’s not — but because I think it’s cute that you can subtract Counter
instances.]
I’m adding my solution to the pile for this 3 year old question because none of the solutions fit what I wanted or used libs besides numpy. This method finds both the indices of duplicates and values for distinct sets of duplicates.
import numpy as np
A = np.array([1,2,3,4,4,4,5,6,6,7,8])
# Record the indices where each unique element occurs.
list_of_dup_inds = [np.where(a == A)[0] for a in np.unique(A)]
# Filter out non-duplicates.
list_of_dup_inds = filter(lambda inds: len(inds) > 1, list_of_dup_inds)
for inds in list_of_dup_inds: print inds, A[inds]
# >> [3 4 5] [4 4 4]
# >> [7 8] [6 6]
If the array is a sorted numpy array, then just do:
a = np.array([1, 2, 2, 3, 4, 5, 5, 6])
rep_el = a[np.diff(a) == 0]
As of numpy version 1.9.0, np.unique
has an argument return_counts
which greatly simplifies your task:
u, c = np.unique(a, return_counts=True)
dup = u[c > 1]
This is similar to using Counter
, except you get a pair of arrays instead of a mapping. I’d be curious to see how they perform relative to each other.
It’s probably worth mentioning that even though np.unique
is quite fast in practice due to its numpyness, it has worse algorithmic complexity than the Counter
solution. np.unique
is sort-based, so runs asymptotically in O(n log n)
time. Counter
is hash-based, so has O(n)
complexity. This will not matter much for anything but the largest datasets.
>>> import numpy as np
>>> a=np.array([1,2,2,2,2,3])
>>> uniques, uniq_idx, counts = np.unique(a,return_index=True,return_counts=True)
>>> duplicates = a[ uniq_idx[counts>=2] ] # <--- Get duplicates
If you also want to get the orphans:
>>> orphans = a[ uniq_idx[counts==1] ]
Combination of Pandas and Numpy (Utilizing value_counts():
import pandas as pd
import numpy as np
arr=np.array(('a','b','b','c','a'))
pd.Series(arr).value_counts()
OUTPUT:
a 2
b 2
c 1
Using the subtract method of a counter circumvents creating a second counter like in DSM’s answer and to get only the positive counts (ie: duplicates) use the unary +
operator on the resulting Counter
a = [1, 2, 1, 3, 3, 3, 0]
c = Counter(a)
c.subtract(d.keys())
dupes = (+c).keys()
I found this to be the best performing solution in my testing
Suppose I have an array
a = np.array([1, 2, 1, 3, 3, 3, 0])
How can I (efficiently, Pythonically) find which elements of a
are duplicates (i.e., non-unique values)? In this case the result would be array([1, 3, 3])
or possibly array([1, 3])
if efficient.
I’ve come up with a few methods that appear to work:
Masking
m = np.zeros_like(a, dtype=bool)
m[np.unique(a, return_index=True)[1]] = True
a[~m]
Set operations
a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]
This one is cute but probably illegal (as a
isn’t actually unique):
np.setxor1d(a, np.unique(a), assume_unique=True)
Histograms
u, i = np.unique(a, return_inverse=True)
u[np.bincount(i) > 1]
Sorting
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
Pandas
s = pd.Series(a)
s[s.duplicated()]
Is there anything I’ve missed? I’m not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).
Conclusions
Testing with a 10 million size data set (on a 2.8GHz Xeon):
a = np.random.randint(10**7, size=10**7)
The fastest is sorting, at 1.1s. The dubious xor1d
is second at 2.6s, followed by masking and Pandas Series.duplicated
at 3.1s, bincount
at 5.6s, and in1d
and senderle’s setdiff1d
both at 7.3s. Steven’s Counter
is only a little slower, at 10.5s; trailing behind are Burhan’s Counter.most_common
at 110s and DSM’s Counter
subtraction at 360s.
I’m going to use sorting for performance, but I’m accepting Steven’s answer because the performance is acceptable and it feels clearer and more Pythonic.
Edit: discovered the Pandas solution. If Pandas is available it’s clear and performs well.
For Python 2.7+
>>> import numpy
>>> from collections import Counter
>>> n = numpy.array([1,1,2,3,3,3,0])
>>> [x[1] for x in Counter(n).most_common() if x[0] > 1]
[3, 1]
I think this is most clear done outside of numpy
. You’ll have to time it against your numpy
solutions if you are concerned with speed.
>>> import numpy as np
>>> from collections import Counter
>>> a = np.array([1, 2, 1, 3, 3, 3, 0])
>>> [item for item, count in Counter(a).items() if count > 1]
[1, 3]
note: This is similar to Burhan Khalid’s answer, but the use of items
without subscripting in the condition should be faster.
Here’s another approach using set operations that I think is a bit more straightforward than the ones you offer:
>>> indices = np.setdiff1d(np.arange(len(a)), np.unique(a, return_index=True)[1])
>>> a[indices]
array([1, 3, 3])
I suppose you’re asking for numpy
-only solutions, since if that’s not the case, it’s very difficult to argue with just using a Counter
instead. I think you should make that requirement explicit though.
If a
is made up of small integers you can use numpy.bincount directly:
import numpy as np
a = np.array([3, 2, 2, 0, 4, 3])
counts = np.bincount(a)
print np.where(counts > 1)[0]
# array([2, 3])
This is very similar your “histogram” method, which is the one I would use if a
was not made up of small integers.
People have already suggested Counter
variants, but here’s one which doesn’t use a listcomp:
>>> from collections import Counter
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> (Counter(a) - Counter(set(a))).keys()
[1, 3]
[Posted not because it’s efficient — it’s not — but because I think it’s cute that you can subtract Counter
instances.]
I’m adding my solution to the pile for this 3 year old question because none of the solutions fit what I wanted or used libs besides numpy. This method finds both the indices of duplicates and values for distinct sets of duplicates.
import numpy as np
A = np.array([1,2,3,4,4,4,5,6,6,7,8])
# Record the indices where each unique element occurs.
list_of_dup_inds = [np.where(a == A)[0] for a in np.unique(A)]
# Filter out non-duplicates.
list_of_dup_inds = filter(lambda inds: len(inds) > 1, list_of_dup_inds)
for inds in list_of_dup_inds: print inds, A[inds]
# >> [3 4 5] [4 4 4]
# >> [7 8] [6 6]
If the array is a sorted numpy array, then just do:
a = np.array([1, 2, 2, 3, 4, 5, 5, 6])
rep_el = a[np.diff(a) == 0]
As of numpy version 1.9.0, np.unique
has an argument return_counts
which greatly simplifies your task:
u, c = np.unique(a, return_counts=True)
dup = u[c > 1]
This is similar to using Counter
, except you get a pair of arrays instead of a mapping. I’d be curious to see how they perform relative to each other.
It’s probably worth mentioning that even though np.unique
is quite fast in practice due to its numpyness, it has worse algorithmic complexity than the Counter
solution. np.unique
is sort-based, so runs asymptotically in O(n log n)
time. Counter
is hash-based, so has O(n)
complexity. This will not matter much for anything but the largest datasets.
>>> import numpy as np
>>> a=np.array([1,2,2,2,2,3])
>>> uniques, uniq_idx, counts = np.unique(a,return_index=True,return_counts=True)
>>> duplicates = a[ uniq_idx[counts>=2] ] # <--- Get duplicates
If you also want to get the orphans:
>>> orphans = a[ uniq_idx[counts==1] ]
Combination of Pandas and Numpy (Utilizing value_counts():
import pandas as pd
import numpy as np
arr=np.array(('a','b','b','c','a'))
pd.Series(arr).value_counts()
OUTPUT:
a 2
b 2
c 1
Using the subtract method of a counter circumvents creating a second counter like in DSM’s answer and to get only the positive counts (ie: duplicates) use the unary +
operator on the resulting Counter
a = [1, 2, 1, 3, 3, 3, 0]
c = Counter(a)
c.subtract(d.keys())
dupes = (+c).keys()
I found this to be the best performing solution in my testing