Need to remove (and partiallly merge) nearly duplicate items from list of dictionaries

Question:

I have a list of dictionaries in this form: (example) [{name: aa, year: 2022}, {name: aa, year: 2021}, {name: bb, year: 2016}, {name: cc, year: 2015}]. What i need is to remove the items where the name is the same, but make a list where the years are added together (every year can be in a list, for my purposes, this doesn’t matter). So the example list of dictionaries would look like this: [{name: aa, year: [2022, 2021}, {name: bb, year: [2016]}, {name: cc, year: [2015]}]. My current code looks like this.

def read_csv_file(self, path):
    book_list = []
    with open(path) as f:
        read_dict = csv.DictReader(f)
        for i in read_dict:
            book_list.append(i)
           

    bestsellers = []
    for i in list_of_books:
        seen_books = []
        years_list = []
        if i["Name"] not in seen_books:
            years_list.append(i["Year"])
            seen_books.append(i)
        else:
            years_list.append(i["Year"])

        if i['Genre'] == 'Non Fiction':
            bestsellers.append(FictionBook(i["Name"], i["Author"], float(i["User Rating"]), int(i["Reviews"]), float(i["Price"]), years_list, i["Genre"]))
        else:
            bestsellers.append(NonFictionBook(i["Name"], i["Author"], float(i["User Rating"]), int(i["Reviews"]), float(i["Price"]), years_list, i["Genre"]))
    for i in bestseller:
        print(i.title)

Ultimately my code needs to extract data from a csv file and then create instances of the class Fictionbook or Nonfictionbook depending on the genre. I think i have the CSV file and making the books finished, i just need to filter the near-duplicate dictionaries and merge them in the lists of years if that makes sense. If anything is unclear please let me know, so i can explain further.

Asked By: Lijpe

||

Answers:

This works:

dict_list = [{'name': 'aa', 'year': 2022}, {'name': 'aa', 'year': 2021}, {'name': 'bb', 'year': 2016}, {'name': 'cc', 'year': 2015}]

new_dict_list = []
names_seen = set()
for name in [d['name'] for d in dict_list]:
    if not name in names_seen:
        new_dict_list.append({'name':name, 'year':[d['year'] for d in dict_list if d['name']==name]})
    names_seen.add(name)

new_dict_list
# Out[68]: 
# [{'name': 'aa', 'year': [2022, 2021]},
#  {'name': 'bb', 'year': [2016]},
#  {'name': 'cc', 'year': [2015]}]
Answered By: Swifty

Use dict.setdefault() to create a list if the key has not yet been seen:

lod=[{'name': 'aa', 'year': 2022}, {'name': 'aa', 'year': 2021}, {'name': 'bb', 'year': 2016}, {'name': 'cc', 'year': 2015}]

result={}
for d in lod:
    result.setdefault(d['name'], []).append(d['year'])

>>> result
{'aa': [2022, 2021], 'bb': [2016], 'cc': [2015]}

Then put the list back together:

>>> [{'name': n, 'year': v} for n,v in result.items()]
[{'name': 'aa', 'year': [2022, 2021]}, {'name': 'bb', 'year': [2016]}, {'name': 'cc', 'year': [2015]}]

From comments:

Great answer, thanks. How would i go about in implementing this in my system if i have more than 2 key,value pairs per dictionary? For example {name: aa, singer: bb, album: gg, year: 2022}

I would do what you are describing differently. It appears you are creating a database of books, albums and authors. Use a class to describe piece of data that you want to catalog.

Consider this simple entry for a piece of art, book, etc:

class Entry:
    def __init__(self, n, name=None, author=None, singer=None, title=None, year=None):
        self.num=n
        self.title=title
        self.singer=singer
        self.name=name
        self.year=year
        self.author=author
        # etc
        
    def __repr__(self):   # allows each item to be printed
        return repr(({self.num}, {self.year}, {self.author}))

Now create some dummy entries:

import random

entries=[Entry(i, 
            author=random.choice(['Bob', 'Carol', 'Ted', 'Alice', 'Lisa']),
            year=random.randint(1700, 2022)
        ) 
        for i in range(3_000_000)]

Creating 3,000,000 entries (a bit more than 1% of the Library of Congress book catalog) takes about 5 seconds.

You could query it like so:

# book for 1799 with an author with 'a' in the name?

[e for e in entries if e.year==1799 and 'a' in e.author.lower() ]

That query took about 1.4 secs on my computer.

It would be monumentally faster using a better data structure than a list of objects (with those objects being dicts or the object shown here.)

A candidate would be a form of a tree but it all depends on what you are looking to query from this data. The Dewey Decimal System is a particular form of a tree.

Answered By: dawg