Remove consecutive duplicates from nested list in Python?

Question:

I have a nested list that has the following structure:

mylist = [['A', 'Car', '15'], ['A', 'Car', '15'], ['A', 'Plane', '16'], ['A', 'Bike', '20'], ['A', 'Car', '16'], ['A', 'Boat', '16']]

It’s super long, with around 10 million elements. And I have many of these lists. What I want to do is:

If the third items (the string numbers) of each consecutive element from mylist are duplicates, remove the elements that contain this duplicate.

For example:

['A', 'Car', '15'] and ['A', 'Car', '15'] are consecutive elements from mylist, and they both contain '15', so they are consecutive duplicates, and one should be removed.

Similarly, ['A', 'Car', '16'] and ['A', 'Boat', '16'] are consecutive and both contain '16', so one should be removed.

So, what I would end up with is:

newlist = [['A', 'Car', '15'], ['A', 'Plane', '16'], ['A', 'Bike', '20'], ['A', 'Car', '16']]

I initially wrote this:

for ele in mylist:
    eleindex = mylist.index(ele)
    previousele = mylist[eleindex-1]
    if float(ele[2]) != float(previousele[2]):
        newlist.append(ele)

Unfortunately, the code I wrote took way to long for such long lists. So, I began looking online and learned that the itertools library (using groupby) is useful and very fast at doing these kinds of things. I then found some examples that I tried emulating, however, they were mainly for simple lists – not something a little more complicated like my situation. After tinkering around, I wasn’t able to figure out how to use it for my nested lists.

So, does anyone know how to do this very quickly? Also, if you have a solution that will be faster than itertools, that’s even better!

Asked By: George Orwell

||

Answers:

A solution with itertools.groupby:

from itertools import groupby

mylist = [['A', 'Car', '15'], ['A', 'Car', '15'], ['A', 'Plane', '16'], ['A', 'Bike', '20'], ['A', 'Car', '16'], ['A', 'Boat', '16']]

out = [next(g) for _, g in groupby(mylist, lambda k: k[2])]

print(out)

Prints:

[['A', 'Car', '15'], ['A', 'Plane', '16'], ['A', 'Bike', '20'], ['A', 'Car', '16']]

Benchmark (with 10_000_000 item list):

from timeit import timeit
from random import randint
from itertools import groupby

mylist = []
for i in range(10_000_000):
    mylist.append(['X', 'X', str(randint(0, 20))])

def f1():
    out = [next(g) for _, g in groupby(mylist, lambda k: k[2])]
    return out

t1 = timeit(lambda: f1(), number=1)

print(t1)

This prints on my machine (AMD 2400G, Python 3.8):

2.408908904006239
Answered By: Andrej Kesely