Efficient way to remove half of the duplicate items in a list
Question:
If I have a list say l = [1, 8, 8, 8, 1, 3, 3, 8]
and it’s guaranteed that every element occurs an even number of times, how do I make a list with all elements of l
now occurring n/2
times. So since 1
occurred 2
times, it should now occur once. Since 8
occurs 4
times, it should now occur twice. Since 3
occurred twice, it should occur once.
So the new list will be something like k=[1,8,8,3]
What is the fastest way to do this?
I did list.count()
for every element but it was very slow.
Answers:
Use a counter to keep track of the count of each element
from collections import Counter
l = [1,8,8,8,1,3,3,8]
res = []
count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
for key, val in count.items():
res.extend(val//2 * [key])
print(res)
# output
[1, 8, 8, 3]
If you are not concerned about preserving relative order, you can first get a count of each element using collections.Counter
, then create a new list with each element duplicated half as many times.
>>> from collections import Counter
>>> from itertools import chain
>>> list(chain.from_iterable([key]*(count//2) for key, count in Counter(l).items()))
[1, 8, 8, 3]
Maybe this.
newList = []
for number in l:
if(newList.count(number) < l.count(number)/2):
newList.append(number)
print(newList)
If order isn’t important, a way would be to get the odd or even indexes only after a sort. Those lists will be the same so you only need one of them.
l = [1,8,8,8,1,3,3,8]
l.sort()
# Get all odd indexes
odd = l[1::2]
# Get all even indexes
even = l[::2]
print(odd)
print(odd == even)
Result:
[1, 3, 8, 8]
True
Since you guarantee that each element of the list occurs a multiple of 2, then it is faster to build the counter as you build the output list, rather than building a counter (or sort) first and using it later.
l = [1,8,8,8,1,3,3,8]
count={}
res=[]
for i in l:
if i in count: count[i]+=1
else: count[i]=1
if count[i]%2: res.append(i)
print(res)
Output
[1,8,8,3]
EDIT Comparing time/expense of each method
Using the timeit
module shows that this approach is 2.7 times faster than using a counter first.
i.e.
def one():
l = [1,8,8,8,1,3,3,8]
count={}
res=[]
for i in l:
if i in count: count[i]+=1
else: count[i]=1
if count[i]%2: res.append(i)
#print(res)
def two():
from collections import Counter
l = [1,8,8,8,1,3,3,8]
res = []
count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
for key, val in count.items():
res.extend(val//2 * [key])
o=timeit.Timer(one)
t=timeit.Timer(two)
print(o.timeit(100000))
print(t.timeit(100000))
print(o.timeit(100000))
print(t.timeit(100000))
Output (seconds)
0.28666
0.80822
0.28678
0.80113
If order isn’t important, then Wimanicesir’s method would be preferred with 4x greater speedup, with result of 0.07037 (~11 times faster than with counter approach).
UPDATE
I suspected that using the Counter
method in two
(unordered) may come with significant bloat or slow down in import, so I tested the "count first, compile result later" method while counting with the simple method here from one
(ordered)
count={}
for i in l:
if i in count: count[i]+=1
else: count[i]=1
which was much faster than Counter
. Replacing Counter
in two
of the tests defined resulted in a time of 0.31 instead of 0.80. Still slightly faster to compile (ordered) result during counting as in two
, however. And much faster for unordered result to use Wimanicesir’s method.
you keep a list of all items that have been visited an uneven number of times.
then you iterate over all list items.
in other langauges would probably use some map() or filter() method, but here is some simple code since i don’t know python well enough! 🙂
l = [1,8,8,8,1,3,3,8]
seen = []
result = []
for num in l:
if num in seen:
seen.remove(num)
#result.append(num) #print every even appearance
else:
seen.append(num)
result.append(num) #print every odd appearance
if len(seen)==0:
print(result)
else:
print("Error: uneven elements found:", seen)
at the end the visited-array should be empty, so you can use that as a sanity check before returning the result-array.
edit: here’s a version with filter that returns the odd appearances
l = [1,8,8,8,1,3,3,8]
seen = []
result = list(filter(lambda x: seen.append(x) is None if x not in seen else not seen.remove(x) is None, l))
if len(seen)==0:
print(result)
else:
print("Error: uneven elements found:", seen)
and this one returns the even appearances:
l = [1,8,8,8,1,3,3,8]
seen = []
result = list(filter(lambda x: seen.remove(x) is None if x in seen else not seen.append(x) is None, l))
if len(seen)==0:
print(result)
else:
print("Error: uneven elements found:", seen)
Instead of using a counter, which keeps track of an integer for each possible element of the list, try mapping elements to booleans using a dictionary. Map to true the first time they’re seen, and then every time after that flip the bit, and if it’s true skip the element.
This is a classic use case of sets and I’m quite surprised that no one else tried it out to see how it stacks against the Counter
and dict
implementations.
I implemented a solution using set
instead as follows:
def set_impl(l):
bag = set()
res = []
for i in l:
if i in bag:
res.append(i)
bag.remove(i)
else:
bag.add(i)
This implementation is about 28% faster than using Counter
and 51% faster than using a dictionary.
The sort and slice implementation given by Wimanicesir is the fastest, giving results 17 times faster than when using set
. Note though, that because it sorts the items before removing duplicates, the order of appearance is not preserved unlike the other three.
Here are all the suggested implementations with timing for evaluation of comparative performance.
https://repl.it/@franzalex/StackOverflow-py#removeDuplicateHalf.py
import random
import statistics as stats
from collections import Counter as counter
from timeit import Timer
def slice_impl(l):
l.sort()
res = l[::2]
def dict_impl(l):
count={}
res=[]
for i in l:
if i in count:
count[i] += 1
else:
count[i] = 1
if count[i] % 2:
res.append(i)
def counter_impl(l):
count = counter(l)
res = []
for key, val in count.items():
res.extend(val//2 * [key])
def set_impl(l):
bag = set()
res = []
for i in l:
if i in bag:
res.append(i)
bag.remove(i)
else:
bag.add(i)
def timed_run():
for name, func in {"Sort and Slice": slice_impl,
"Dictionary": dict_impl,
"Counter": counter_impl,
"Set": set_impl}.items():
seq = list(range(50))*2
results = []
print(f"{name} Implementation Results")
for i in range(50):
if len(results) % 10: random.shuffle(seq) # shuffle after 10 runs
results.append(Timer(lambda: func(seq)).timeit(10**4))
# print(f"Run {i+1:02}: {results[i]:.6f}")
print("")
print(f"Median: {stats.median(results):.6f}")
print(f"Mean: {stats.mean(results):.6f}")
print(f"Std Dev: {stats.stdev(results):.6f}")
print("nn")
timed_run()
Sample run result
Sort and Slice Implementation Results
Median: 0.009686
Mean: 0.009721
Std Dev: 0.000529
Dictionary Implementation Results
Median: 0.230081
Mean: 0.227631
Std Dev: 0.014584
Counter Implementation Results
Median: 0.192730
Mean: 0.194577
Std Dev: 0.008015
Set Implementation Results
Median: 0.149604
Mean: 0.151227
Std Dev: 0.006838
import itertools
st=time.time()
lst = [1,8,8,8,1,3,3,8]
list(itertools.chain.from_iterable(itertools.repeat(x, int(lst.count(x)/2)) for x in list(set(lst)) if lst.count(x)%2==0))
This gives sorted list
If the order matters, following code can work with O(N):
import collections
c = collections.Counter(l)
c2 = collections.Counter()
i, n = 0, len(l)
res=[]
for x in l:
if i == n//2:break
if c2[x] < c[x] // 2:
res.append(x)
c2[x] += 1
i += 1
I know this has been answered and there are some quite lengthy solutions.
And it specifically mentioned Python.
However, I thought a Powershell solution might be interesting (and simple!) for some:
Version 1 (grouping items – less efficient)
$OriginalArray = @("1","8","8","8","1","3","3","8")
$NewArray = New-ObjectSystem.Collections.ArrayList
$ArrayGroup = $OriginalArray | Group-Object | Select-Object Count,Name
ForEach ($EachNumber in $ArrayGroup) {
$HalfTheCount = (1..([Math]::Round($EachNumber.Count / 2)))
ForEach ($Item in $HalfTheCount) {$NewArray.Add($EachNumber.Name) | Out-Null}
}
$NewArray
Version 2 (picking every other item from a sorted array – more efficient)
$OriginalArray = @("1","8","8","8","1","3","3","8")
$NewArray = New-Object System.Collections.ArrayList
$OddOrEven = "Even"
ForEach ($SortedItem in ($OriginalArray | Sort-Object)) {
If ($OddOrEven -eq "Even") {$NewArray.Add($SortedItem);$EvenNumber = $True}
If ($OddOrEven -eq "Odd") {$EvenNumber = $False}
If ($EvenNumber -eq $True) {$OddOrEven = "Odd"} Else {$OddOrEven = "Even"}
}
$NewArray
I like using a trie set, as you need to detect duplicates to remove them, or a big hash set (lots of buckets). The trie does not go unbalanced and you do not need to know the size of the final set. An alternative is a very parallel sort — brute force.
If I have a list say l = [1, 8, 8, 8, 1, 3, 3, 8]
and it’s guaranteed that every element occurs an even number of times, how do I make a list with all elements of l
now occurring n/2
times. So since 1
occurred 2
times, it should now occur once. Since 8
occurs 4
times, it should now occur twice. Since 3
occurred twice, it should occur once.
So the new list will be something like k=[1,8,8,3]
What is the fastest way to do this?
I did list.count()
for every element but it was very slow.
Use a counter to keep track of the count of each element
from collections import Counter
l = [1,8,8,8,1,3,3,8]
res = []
count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
for key, val in count.items():
res.extend(val//2 * [key])
print(res)
# output
[1, 8, 8, 3]
If you are not concerned about preserving relative order, you can first get a count of each element using collections.Counter
, then create a new list with each element duplicated half as many times.
>>> from collections import Counter
>>> from itertools import chain
>>> list(chain.from_iterable([key]*(count//2) for key, count in Counter(l).items()))
[1, 8, 8, 3]
Maybe this.
newList = []
for number in l:
if(newList.count(number) < l.count(number)/2):
newList.append(number)
print(newList)
If order isn’t important, a way would be to get the odd or even indexes only after a sort. Those lists will be the same so you only need one of them.
l = [1,8,8,8,1,3,3,8]
l.sort()
# Get all odd indexes
odd = l[1::2]
# Get all even indexes
even = l[::2]
print(odd)
print(odd == even)
Result:
[1, 3, 8, 8]
True
Since you guarantee that each element of the list occurs a multiple of 2, then it is faster to build the counter as you build the output list, rather than building a counter (or sort) first and using it later.
l = [1,8,8,8,1,3,3,8]
count={}
res=[]
for i in l:
if i in count: count[i]+=1
else: count[i]=1
if count[i]%2: res.append(i)
print(res)
Output
[1,8,8,3]
EDIT Comparing time/expense of each method
Using the timeit
module shows that this approach is 2.7 times faster than using a counter first.
i.e.
def one():
l = [1,8,8,8,1,3,3,8]
count={}
res=[]
for i in l:
if i in count: count[i]+=1
else: count[i]=1
if count[i]%2: res.append(i)
#print(res)
def two():
from collections import Counter
l = [1,8,8,8,1,3,3,8]
res = []
count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
for key, val in count.items():
res.extend(val//2 * [key])
o=timeit.Timer(one)
t=timeit.Timer(two)
print(o.timeit(100000))
print(t.timeit(100000))
print(o.timeit(100000))
print(t.timeit(100000))
Output (seconds)
0.28666
0.80822
0.28678
0.80113
If order isn’t important, then Wimanicesir’s method would be preferred with 4x greater speedup, with result of 0.07037 (~11 times faster than with counter approach).
UPDATE
I suspected that using the Counter
method in two
(unordered) may come with significant bloat or slow down in import, so I tested the "count first, compile result later" method while counting with the simple method here from one
(ordered)
count={}
for i in l:
if i in count: count[i]+=1
else: count[i]=1
which was much faster than Counter
. Replacing Counter
in two
of the tests defined resulted in a time of 0.31 instead of 0.80. Still slightly faster to compile (ordered) result during counting as in two
, however. And much faster for unordered result to use Wimanicesir’s method.
you keep a list of all items that have been visited an uneven number of times.
then you iterate over all list items.
in other langauges would probably use some map() or filter() method, but here is some simple code since i don’t know python well enough! 🙂
l = [1,8,8,8,1,3,3,8]
seen = []
result = []
for num in l:
if num in seen:
seen.remove(num)
#result.append(num) #print every even appearance
else:
seen.append(num)
result.append(num) #print every odd appearance
if len(seen)==0:
print(result)
else:
print("Error: uneven elements found:", seen)
at the end the visited-array should be empty, so you can use that as a sanity check before returning the result-array.
edit: here’s a version with filter that returns the odd appearances
l = [1,8,8,8,1,3,3,8]
seen = []
result = list(filter(lambda x: seen.append(x) is None if x not in seen else not seen.remove(x) is None, l))
if len(seen)==0:
print(result)
else:
print("Error: uneven elements found:", seen)
and this one returns the even appearances:
l = [1,8,8,8,1,3,3,8]
seen = []
result = list(filter(lambda x: seen.remove(x) is None if x in seen else not seen.append(x) is None, l))
if len(seen)==0:
print(result)
else:
print("Error: uneven elements found:", seen)
Instead of using a counter, which keeps track of an integer for each possible element of the list, try mapping elements to booleans using a dictionary. Map to true the first time they’re seen, and then every time after that flip the bit, and if it’s true skip the element.
This is a classic use case of sets and I’m quite surprised that no one else tried it out to see how it stacks against the Counter
and dict
implementations.
I implemented a solution using set
instead as follows:
def set_impl(l):
bag = set()
res = []
for i in l:
if i in bag:
res.append(i)
bag.remove(i)
else:
bag.add(i)
This implementation is about 28% faster than using Counter
and 51% faster than using a dictionary.
The sort and slice implementation given by Wimanicesir is the fastest, giving results 17 times faster than when using set
. Note though, that because it sorts the items before removing duplicates, the order of appearance is not preserved unlike the other three.
Here are all the suggested implementations with timing for evaluation of comparative performance.
https://repl.it/@franzalex/StackOverflow-py#removeDuplicateHalf.py
import random
import statistics as stats
from collections import Counter as counter
from timeit import Timer
def slice_impl(l):
l.sort()
res = l[::2]
def dict_impl(l):
count={}
res=[]
for i in l:
if i in count:
count[i] += 1
else:
count[i] = 1
if count[i] % 2:
res.append(i)
def counter_impl(l):
count = counter(l)
res = []
for key, val in count.items():
res.extend(val//2 * [key])
def set_impl(l):
bag = set()
res = []
for i in l:
if i in bag:
res.append(i)
bag.remove(i)
else:
bag.add(i)
def timed_run():
for name, func in {"Sort and Slice": slice_impl,
"Dictionary": dict_impl,
"Counter": counter_impl,
"Set": set_impl}.items():
seq = list(range(50))*2
results = []
print(f"{name} Implementation Results")
for i in range(50):
if len(results) % 10: random.shuffle(seq) # shuffle after 10 runs
results.append(Timer(lambda: func(seq)).timeit(10**4))
# print(f"Run {i+1:02}: {results[i]:.6f}")
print("")
print(f"Median: {stats.median(results):.6f}")
print(f"Mean: {stats.mean(results):.6f}")
print(f"Std Dev: {stats.stdev(results):.6f}")
print("nn")
timed_run()
Sample run result
Sort and Slice Implementation Results Median: 0.009686 Mean: 0.009721 Std Dev: 0.000529 Dictionary Implementation Results Median: 0.230081 Mean: 0.227631 Std Dev: 0.014584 Counter Implementation Results Median: 0.192730 Mean: 0.194577 Std Dev: 0.008015 Set Implementation Results Median: 0.149604 Mean: 0.151227 Std Dev: 0.006838
import itertools
st=time.time()
lst = [1,8,8,8,1,3,3,8]
list(itertools.chain.from_iterable(itertools.repeat(x, int(lst.count(x)/2)) for x in list(set(lst)) if lst.count(x)%2==0))
This gives sorted list
If the order matters, following code can work with O(N):
import collections
c = collections.Counter(l)
c2 = collections.Counter()
i, n = 0, len(l)
res=[]
for x in l:
if i == n//2:break
if c2[x] < c[x] // 2:
res.append(x)
c2[x] += 1
i += 1
I know this has been answered and there are some quite lengthy solutions.
And it specifically mentioned Python.
However, I thought a Powershell solution might be interesting (and simple!) for some:
Version 1 (grouping items – less efficient)
$OriginalArray = @("1","8","8","8","1","3","3","8")
$NewArray = New-ObjectSystem.Collections.ArrayList
$ArrayGroup = $OriginalArray | Group-Object | Select-Object Count,Name
ForEach ($EachNumber in $ArrayGroup) {
$HalfTheCount = (1..([Math]::Round($EachNumber.Count / 2)))
ForEach ($Item in $HalfTheCount) {$NewArray.Add($EachNumber.Name) | Out-Null}
}
$NewArray
Version 2 (picking every other item from a sorted array – more efficient)
$OriginalArray = @("1","8","8","8","1","3","3","8")
$NewArray = New-Object System.Collections.ArrayList
$OddOrEven = "Even"
ForEach ($SortedItem in ($OriginalArray | Sort-Object)) {
If ($OddOrEven -eq "Even") {$NewArray.Add($SortedItem);$EvenNumber = $True}
If ($OddOrEven -eq "Odd") {$EvenNumber = $False}
If ($EvenNumber -eq $True) {$OddOrEven = "Odd"} Else {$OddOrEven = "Even"}
}
$NewArray
I like using a trie set, as you need to detect duplicates to remove them, or a big hash set (lots of buckets). The trie does not go unbalanced and you do not need to know the size of the final set. An alternative is a very parallel sort — brute force.