Remove duplicates in list of object with Python

Question:

I’ve got a list of objects and I’ve got a db table full of records. My list of objects has a title attribute and I want to remove any objects with duplicate titles from the list (leaving the original).

Then I want to check if my list of objects has any duplicates of any records in the database and if so, remove those items from list before adding them to the database.

I have seen solutions for removing duplicates from a list like this: myList = list(set(myList)), but i’m not sure how to do that with a list of objects?

I need to maintain the order of my list of objects too. I was also thinking maybe I could use difflib to check for differences in the titles.

Asked By: imns

||

Answers:

Since they’re not hashable, you can’t use a set directly. The titles should be though.

Here’s the first part.

seen_titles = set()
new_list = []
for obj in myList:
    if obj.title not in seen_titles:
        new_list.append(obj)
        seen_titles.add(obj.title)

You’re going to need to describe what database/ORM etc. you’re using for the second part though.

Answered By: aaronasterling

This seems pretty minimal:

new_dict = dict()
for obj in myList:
    if obj.title not in new_dict:
        new_dict[obj.title] = obj
Answered By: hughdbrown

The set(list_of_objects) will only remove the duplicates if you know what a duplicate is, that is, you’ll need to define a uniqueness of an object.

In order to do that, you’ll need to make the object hashable. You need to define both __hash__ and __eq__ method, here is how:

http://docs.python.org/glossary.html#term-hashable

Though, you’ll probably only need to define __eq__ method.

EDIT: How to implement the __eq__ method:

You’ll need to know, as I mentioned, the uniqueness definition of your object. Supposed we have a Book with attributes author_name and title that their combination is unique, (so, we can have many books Stephen King authored, and many books named The Shining, but only one book named The Shining by Stephen King), then the implementation is as follows:

def __eq__(self, other):
    return self.author_name==other.author_name
           and self.title==other.title

Similarly, this is how I sometimes implement the __hash__ method:

def __hash__(self):
    return hash(('title', self.title,
                 'author_name', self.author_name))

You can check that if you create a list of 2 books with same author and title, the book objects will be the same (with is operator) and equal (with == operator). Also, when set() is used, it will remove one book.

EDIT: This is one old anwser of mine, but I only now notice that it has the error which is corrected with strikethrough in the last paragraph: objects with the same hash() won’t give True when compared with is. Hashability of object is used, however, if you intend to use them as elements of set, or as keys in dictionary.

Answered By: vonPetrushev

Its quite easy freinds :-

a = [5,6,7,32,32,32,32,32,32,32,32]

a = list(set(a))

print (a)

[5,6,7,32]

thats it ! 🙂

Answered By: Spiderman

If you want to preserve the original order use it:

seen = {}
new_list = [seen.setdefault(x, x) for x in my_list if x not in seen]

If you don’t care of ordering then use it:

new_list = list(set(my_list))
Answered By: Amir

Both __hash__ and __eq__ are needed for this.

__hash__ is needed to add an object to a set, since python’s sets are implemented as hashtables. By default, immutable objects like numbers, strings, and tuples are hashable.

However, hash collisions (two distinct objects hashing to the same value) are inevitable, due to the pigeonhole principle. So, two objects cannot be distinguished only using their hash, and the user must specify their own __eq__ function. Thus, the actual hash function the user provides is not crucial, though it is best to try to avoid hash collisions for performance (see What's a correct and good way to implement __hash__()?).

Answered By: qwr

I recently ended up using the code below. It is similar to other answers as it iterates over the list and records what it is seeing and then removes any item that it has already seen but it doesn’t create a duplicate list, instead it just deletes the item from original list.

seen = {}
for obj in objList:
    if obj["key-property"] in seen.keys():
        objList.remove(obj)
    else:
        seen[obj["key-property"]] = 1
Answered By: binW

If you can’t (or won’t) define __eq__ for the objects, you can use a dict-comprehension to achieve the same end:

unique = list({item.attribute:item for item in mylist}.values())

Note that this will contain the last instance of a given key, e.g.
for mylist = [Item(attribute=1, tag='first'), Item(attribute=1, tag='second'), Item(attribute=2, tag='third')] you get [Item(attribute=1, tag='second'), Item(attribute=2, tag='third')]. You can get around this by using mylist[::-1] (if the full list is present).

Answered By: Dave

For non-hashable types you can use a dictionary comprehension to remove duplicate objects based on a field in all objects. This is particularly useful for Pydantic which doesn’t support hashable types by default:

{ row.title : row for row in rows }.values()

Note that this will consider duplicates solely based on on row.title, and will take the last matched object for row.title. This means if your rows may have the same title but different values in other attributes, then this won’t work.

e.g. [{"title": "test", "myval": 1}, {"title": "test", "myval": 2}] ==> [{"title": "test", "myval": 2}]

If you wanted to match against multiple fields in row, you could extend this further:

{ f"{row.title}{row.value}" : row for row in rows }.values()

The null character is used as a separator between fields. This assumes that the null character isn’t used in either row.title or row.value.

Answered By: Thomas Anderson
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.