Efficient way to remove half of the duplicate items in a list

Question:

If I have a list say l = [1, 8, 8, 8, 1, 3, 3, 8] and it’s guaranteed that every element occurs an even number of times, how do I make a list with all elements of l now occurring n/2 times. So since 1 occurred 2 times, it should now occur once. Since 8 occurs 4 times, it should now occur twice. Since 3 occurred twice, it should occur once.

So the new list will be something like k=[1,8,8,3]

What is the fastest way to do this?
I did list.count() for every element but it was very slow.

Asked By: NePtUnE

||

Answers:

Use a counter to keep track of the count of each element

from collections import Counter
l = [1,8,8,8,1,3,3,8]
res = []
count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
for key, val in count.items():
    res.extend(val//2 * [key])
print(res)
# output
[1, 8, 8, 3]
Answered By: user13392187

If you are not concerned about preserving relative order, you can first get a count of each element using collections.Counter, then create a new list with each element duplicated half as many times.

>>> from collections import Counter
>>> from itertools import chain
>>> list(chain.from_iterable([key]*(count//2) for key, count in Counter(l).items()))
[1, 8, 8, 3]
Answered By: Cory Kramer

Maybe this.

    newList = []
    for number in l:
        if(newList.count(number) < l.count(number)/2):
            newList.append(number)

print(newList)
Answered By: faiz-e

If order isn’t important, a way would be to get the odd or even indexes only after a sort. Those lists will be the same so you only need one of them.

l = [1,8,8,8,1,3,3,8]
l.sort()

# Get all odd indexes
odd = l[1::2]

# Get all even indexes
even = l[::2]

print(odd)
print(odd == even)

Result:

[1, 3, 8, 8]
True
Answered By: Wimanicesir

Since you guarantee that each element of the list occurs a multiple of 2, then it is faster to build the counter as you build the output list, rather than building a counter (or sort) first and using it later.

l = [1,8,8,8,1,3,3,8]
count={}
res=[]
for i in l:
  if i in count: count[i]+=1
  else: count[i]=1
  if count[i]%2: res.append(i)

print(res)

Output

[1,8,8,3]

EDIT Comparing time/expense of each method

Using the timeit module shows that this approach is 2.7 times faster than using a counter first.

i.e.

def one():
  l = [1,8,8,8,1,3,3,8]
  count={}
  res=[]
  for i in l:
    if i in count: count[i]+=1
    else: count[i]=1
    if count[i]%2: res.append(i)

  #print(res)


def two():
  from collections import Counter
  l = [1,8,8,8,1,3,3,8]
  res = []
  count = Counter(l) # its like dict(1: 2, 8: 4, 3: 2)
  for key, val in count.items():
    res.extend(val//2 * [key])

o=timeit.Timer(one)

t=timeit.Timer(two)

print(o.timeit(100000))

print(t.timeit(100000))

print(o.timeit(100000))

print(t.timeit(100000))

Output (seconds)

0.28666
0.80822
0.28678
0.80113

If order isn’t important, then Wimanicesir’s method would be preferred with 4x greater speedup, with result of 0.07037 (~11 times faster than with counter approach).

UPDATE
I suspected that using the Counter method in two (unordered) may come with significant bloat or slow down in import, so I tested the "count first, compile result later" method while counting with the simple method here from one (ordered)

count={}
for i in l:
  if i in count: count[i]+=1
  else: count[i]=1

which was much faster than Counter. Replacing Counter in two of the tests defined resulted in a time of 0.31 instead of 0.80. Still slightly faster to compile (ordered) result during counting as in two, however. And much faster for unordered result to use Wimanicesir’s method.

Answered By: jpf

you keep a list of all items that have been visited an uneven number of times.
then you iterate over all list items.

in other langauges would probably use some map() or filter() method, but here is some simple code since i don’t know python well enough! 🙂

l = [1,8,8,8,1,3,3,8]
seen = []
result = []
for num in l:
  if num in seen:
    seen.remove(num)
    #result.append(num) #print every even appearance
  else:
    seen.append(num)
    result.append(num) #print every odd appearance

if len(seen)==0:
  print(result)
else:
  print("Error: uneven elements found:", seen)

at the end the visited-array should be empty, so you can use that as a sanity check before returning the result-array.

edit: here’s a version with filter that returns the odd appearances

l = [1,8,8,8,1,3,3,8]
seen = []
result = list(filter(lambda x: seen.append(x) is None if x not in seen else not seen.remove(x) is None, l))

if len(seen)==0:
  print(result)
else:
  print("Error: uneven elements found:", seen)

and this one returns the even appearances:

l = [1,8,8,8,1,3,3,8]
seen = []
result = list(filter(lambda x: seen.remove(x) is None if x in seen else not seen.append(x) is None, l))

if len(seen)==0:
  print(result)
else:
  print("Error: uneven elements found:", seen)
Answered By: the great me

Instead of using a counter, which keeps track of an integer for each possible element of the list, try mapping elements to booleans using a dictionary. Map to true the first time they’re seen, and then every time after that flip the bit, and if it’s true skip the element.

Answered By: Jad Ghalayini

This is a classic use case of sets and I’m quite surprised that no one else tried it out to see how it stacks against the Counter and dict implementations.

I implemented a solution using set instead as follows:

def set_impl(l):
  bag = set()
  res = []
  for i in l:
    if i in bag:
      res.append(i)
      bag.remove(i)
    else:
      bag.add(i)

This implementation is about 28% faster than using Counter and 51% faster than using a dictionary.

The sort and slice implementation given by Wimanicesir is the fastest, giving results 17 times faster than when using set. Note though, that because it sorts the items before removing duplicates, the order of appearance is not preserved unlike the other three.

Here are all the suggested implementations with timing for evaluation of comparative performance.
https://repl.it/@franzalex/StackOverflow-py#removeDuplicateHalf.py

import random
import statistics as stats
from collections import Counter as counter
from timeit import Timer

def slice_impl(l):
  l.sort()
  res = l[::2]

def dict_impl(l):
  count={}
  res=[]
  for i in l:
    if i in count:
      count[i] += 1
    else:
      count[i] = 1
    if count[i] % 2:
      res.append(i)

def counter_impl(l):
  count = counter(l)
  res = []
  for key, val in count.items():
    res.extend(val//2 * [key])

def set_impl(l):
  bag = set()
  res = []
  for i in l:
    if i in bag:
      res.append(i)
      bag.remove(i)
    else:
      bag.add(i)

def timed_run():
  for name, func in {"Sort and Slice": slice_impl, 
                     "Dictionary": dict_impl, 
                     "Counter": counter_impl, 
                     "Set": set_impl}.items():
    seq = list(range(50))*2
    results = []
    print(f"{name} Implementation Results")
    for i in range(50):
      if len(results) % 10: random.shuffle(seq) # shuffle after 10 runs
      results.append(Timer(lambda: func(seq)).timeit(10**4))
      # print(f"Run {i+1:02}: {results[i]:.6f}")
    print("")
    print(f"Median:  {stats.median(results):.6f}")
    print(f"Mean:    {stats.mean(results):.6f}")
    print(f"Std Dev: {stats.stdev(results):.6f}")
    print("nn")

timed_run()

Sample run result

Sort and Slice Implementation Results

Median:  0.009686
Mean:    0.009721
Std Dev: 0.000529


Dictionary Implementation Results

Median:  0.230081
Mean:    0.227631
Std Dev: 0.014584


Counter Implementation Results

Median:  0.192730
Mean:    0.194577
Std Dev: 0.008015


Set Implementation Results

Median:  0.149604
Mean:    0.151227
Std Dev: 0.006838
Answered By: Alex Essilfie
import itertools

st=time.time()
lst = [1,8,8,8,1,3,3,8]
list(itertools.chain.from_iterable(itertools.repeat(x, int(lst.count(x)/2)) for x in list(set(lst)) if lst.count(x)%2==0))

This gives sorted list

Answered By: JUGAL KISHORE

If the order matters, following code can work with O(N):

import collections
c = collections.Counter(l)
c2 = collections.Counter()
i, n = 0, len(l)
res=[]
for x in l:
    if i == n//2:break
    if c2[x] < c[x] // 2:
        res.append(x)
        c2[x] += 1
        i += 1
Answered By: Yang Liu

I know this has been answered and there are some quite lengthy solutions.
And it specifically mentioned Python.
However, I thought a Powershell solution might be interesting (and simple!) for some:

Version 1 (grouping items – less efficient)

$OriginalArray = @("1","8","8","8","1","3","3","8") 
$NewArray = New-ObjectSystem.Collections.ArrayList
$ArrayGroup = $OriginalArray | Group-Object | Select-Object Count,Name

ForEach ($EachNumber in $ArrayGroup) {
    $HalfTheCount = (1..([Math]::Round($EachNumber.Count / 2)))
    ForEach ($Item in $HalfTheCount) {$NewArray.Add($EachNumber.Name) | Out-Null}   
    } 
$NewArray

Version 2 (picking every other item from a sorted array – more efficient)

$OriginalArray = @("1","8","8","8","1","3","3","8") 

$NewArray = New-Object System.Collections.ArrayList

$OddOrEven = "Even"
ForEach ($SortedItem in ($OriginalArray | Sort-Object)) {
    If ($OddOrEven -eq "Even") {$NewArray.Add($SortedItem);$EvenNumber = $True}
    If ($OddOrEven -eq "Odd") {$EvenNumber = $False}
    If ($EvenNumber -eq $True) {$OddOrEven = "Odd"} Else {$OddOrEven = "Even"} 
}
$NewArray
Answered By: Andy Pyne

I like using a trie set, as you need to detect duplicates to remove them, or a big hash set (lots of buckets). The trie does not go unbalanced and you do not need to know the size of the final set. An alternative is a very parallel sort — brute force.

Answered By: David G. Pickett
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.