Filtering dataclass instances by unique attribute value

Question:

I have a list of dataclass instances in the form of:

dataclass_list = [DataEntry(company="Microsoft", users=["Jane Doe", "John Doe"]), DataEntry(company="Google", users=["Bob Whoever"]), DataEntry(company="Microsoft", users=[])]

Now I would like to filter that list and get only unique instances by a certain key (company in this case).

The desired list:

new_list = [DataEntry(company="Microsoft", users=["Jane Doe", "John Doe"]), DataEntry(company="Google", users=["Bob Whoever"])]

The original idea was to use a function in the fashion of python’s set() or filter() functions, but both is not possible here.

My working solution so far:

tup_list = [(dataclass, dataclass.company)) for dataclass in dataclass_list]
new_list = []
check_list = []
for tup in tup_list:
    if tup[1].lower() not in check_list:
        new_list.append(tup[0])
        check_list.append(tup[1].lower())

This gives me the desired output but I was wondering whether there is a more pythonic or elegant solution?

Asked By: Arthuro

||

Answers:

Here’s another solution, whether or not you find that more elegant is up to you:

unique = {}
for dc in dataclass_list:
    if dc.company not in unique:
        unique[dc.company] = dc
new_list = list(unique.values())
Answered By: LukasNeugebauer

In your DataEntry dataclass you need to override __eq__(...) and __hash__(...) functions, in which you specify which attribute is used when calculating the hash value of an object and when are two objects considered equal.

A short example in which the name attribute of the class Company is used by default for determining the equality of two objects. I have also extended your case with an option where you can determine the attribute which will be considered for uniqueness when constructing the object. Mind that all objects that will be compared need to have the same comparison_attr.

import pprint

class Company:

    def __init__(self, name, location, comparison_attr="name") -> None:
        # By default we use the attribute `name` for comparison
        self.name = name
        self.location = location
        self.__comparison_attr = comparison_attr

    def __hash__(self) -> int:
        return hash(self.__getattribute__(self.__comparison_attr))

    def __eq__(self, other: object) -> bool:
        return self.__getattribute__(self.__comparison_attr) == other.__getattribute__(self.__comparison_attr)

    def __repr__(self) -> str:
        return f"name={self.name}, location={self.location}"

for attribute_name in ["name", "location"]:
    companies = [
        Company("Google", "Palo Alto", comparison_attr=attribute_name), 
        Company("Google", "Berlin", comparison_attr=attribute_name),
        Company("Microsoft", "Berlin", comparison_attr=attribute_name),
        Company("Microsoft", "San Francisco", comparison_attr=attribute_name),
        Company("IBM", "Palo Alto", comparison_attr=attribute_name),
    ]

    print(f"Attribute considered for uniqueness: {attribute_name}")
    pprint.pprint(set(companies))

Output:

Attribute considered for uniqueness: name
{name=Microsoft, location=Berlin,
 name=Google, location=Palo Alto,
 name=IBM, location=Palo Alto}

Attribute considered for uniqueness: location
{name=Microsoft, location=San Francisco,
 name=Google, location=Berlin,
 name=Google, location=Palo Alto}
Answered By: AndrejH

The shortest and more readable solution is this:

dataclass_list = [DataEntry(company="Microsoft", users=["Jane Doe", "John Doe"]), DataEntry(company="Google", users=["Bob Whoever"]), DataEntry(company="Microsoft", users=[])]

unique_companies = {data_entry.company: data_entry for data_entry in dataclass_list}.values()

print(unique_companies)
# output: dict_values([DataEntry(company='Microsoft', users=[]), DataEntry(company='Google', users=['Bob Whoever'])])

Answered By: Oenomaus