Merge two (or more) lists of dictionaries pairing using a specific key

Question:

I have a set of different lists of dictionaries (actually obtained reading Excel worksheets) and I need to do an "inner join" on them:

  • each list is equivalent to a database table (each dict is a record)
  • each record has a specific key guaranteed unique in the list (column is "index")
  • I need to produce another list of dictionaries where each dictionary has a given "index" and all other key/value found in all lists where "index" match

To exemplify:

a = [{'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}]

and I want yo have:

c = [{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'zz1', 'fom': 'kk1'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

of course problem is various list may have different length and not be sorted nicely.

Is there an easy way to do this or should I do nested loops explicitly searching for the matching record?

This actually works, but I’m VERY unsure it’s the "most pythonic way":

a = [{'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

c = [{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'zz1', 'fom': 'kk1'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

li = [a, b]
t = [{z['idx']: z for z in w} for w in li]
r = {}
for k in t:
    for j in k:
        if j in r:
            r[j].update(k[j])
        else:
            r[j] = k[j]
r = [t for t in r.values()]

print(r)
[{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'zz0', 'fom': 'kk0'}, {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'zz1', 'fom': 'kk1'}, {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}, {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

Can someone come up with something better?

Asked By: ZioByte

||

Answers:

If you are using Python 3.9 you can use the union operator or update() in older versions (added a third shorter list to the example)

a = [{'idx': 0, 'foo': 'xx0', 'bar': 'yy0'}, {'idx': 1, 'foo': 'xx1', 'bar': 'yy1'}, {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'}, {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}, {'idx': 2, 'fie': 'zz2', 'fom': 'kk2'}]
c = [{'idx': 0, 'ief': 'zz0', 'mof': 'kk0'}, {'idx': 1, 'ief': 'zz1', 'mof': 'kk1'}]

lists = [b, c]

# with union
for lst in lists:
    for i, d in enumerate(lst):
        a[i] = a[i] | d

# with update
for lst in lists:
    for i, d in enumerate(lst):
        a[i].update(d)

print(a)

Edit:

If the dictionaries are not sorted or don’t have the same keys you can sort the during the merge and add the missing keys

a = [{'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}]

a.sort(key=lambda x: x['idx'])
lists = [b, c]
for lst in lists:
    lst.sort(key=lambda x: x['idx'])
    for i, d in enumerate(lst):
        if d['idx'] == a[i]['idx']:
            a[i] = a[i] | d
        else:
            a.append(d)
print(a)

Output

[{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'xx0', 'fom': 'kk0'},
 {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'xx1', 'fom': 'kk1'},
 {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'},
 {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]
Answered By: Guy

I’m not sure if this is more efficient than your solution:

from operator import itemgetter
from itertools import chain, groupby

a = [{'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}]

c = sorted(a + b, key=itemgetter('idx'))
c = [
    dict(chain(*(record.items() for record in group)))
    for _, group in groupby(c, key=itemgetter('idx'))
]

Result:

[{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'zz0', 'fom': 'kk0'},
 {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'zz1', 'fom': 'kk1'},
 {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'},
 {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]
Answered By: Timus

This is basically the same as your code, as far as the algorithm. You had the right idea using O(1) dict lookup, and update to merge the dicts.

from itertools import chain
from collections import defaultdict
from pprint import pprint

a = [{'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}]

KEY = 'idx'
merged = defaultdict(dict)
for row in chain(a, b):
    merged[row[KEY]].update(row)

pprint(list(merged.values()))

I tried not to use any single letter variable names (besides the original inputs)
itertools.chain lets you iterate over several iterables as one
defaultdict hides some of that "if it’s in there already, do this, otherwise do that"
[x for x in iterable] could be written list(iterable)
The "merged" data structure is more useful. It’s a shame to dump it out to an inefficient list, but that was the requirement.

If possible, you could return merged.values(), which is an iterable view object https://docs.python.org/3.7/library/stdtypes.html?highlight=dict%20values#dictionary-view-objects

Concerns:
This could be handled in a database or pandas, which are designed for this exact function.
What if the rows happen to have a conflict on one of the data fields? You’ll never know, as update will just overwrite.

Answered By: Kenny Ostrom
from collections import defaultdict
from operator import itemgetter
l1 =[{'id': 1, 'City': 'Calcutta'}, {'id': 3, 'Country': 'Germany'}]
l2 = [{'id': 1, 'Country': 'India'}, {'id': 2, 'City': 'Delhi'}, {'id': 3, 'City': 'Berlin'}]
    
def merge1(l1,l2):
    d = defaultdict(dict)
    for l in (l1, l2):
        for innerdict1 in l:
            d[innerdict1['id']].update(innerdict1)
    
    l4 = sorted(d.values(), key=itemgetter("id"))
    l4p = print(l4)
    return l4p
merge1(l1, l2)
    
"""
[{'id': 1, 'City': 'Delhi', 'Country': 'India'}, {'id': 2, 'City': 'Calcutta'}, {'id': 3, 'Country': 'Germany', 'City': 'Berlin'}]
    
"""
"""
This second one contains a key(denoted by A), common to all. A slight difference in structure. Please take a look.
"""

l1 =[{'A' :{'id': 1, 'City': 'Calcutta'}}, {'A' : {'id': 3, 'Country': 'Germany'}}]
l2 = [{'A' : {'id': 1, 'Country': 'India'}}, {'A':{'id': 2, 'City': 'Delhi'}}, {'A':{'id': 3, 'City': 'Berlin'}}]


def merge2(l1,l2):
    d= defaultdict(dict)
    for l in (l1,l2):
        for innerdict in l :
            d[innerdict['A']['id']].update(innerdict['A'])
    l3 = sorted(d.values(), key = itemgetter('id'))
    l3p = print(l3)
    return l3p
merge2(l1, l2)

        
            
Answered By: Soudipta Dutta
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.