Intersecting two dictionaries

Question:

I am working on a search program over an inverted index. The index itself is a dictionary whose keys are terms and whose values are themselves dictionaries of short documents, with ID numbers as keys and their text content as values.

To perform an ‘AND’ search for two terms, I thus need to intersect their postings lists (dictionaries). What is a clear (not necessarily overly clever) way to do this in Python? I started out by trying it the long way with iter:

p1 = index[term1]  
p2 = index[term2]
i1 = iter(p1)
i2 = iter(p2)
while ...  # not sure of the 'iter != end 'syntax in this case
...
Asked By: norman

||

Answers:

In general, to construct the intersection of dictionaries in Python, you can first use the & operator to calculate the intersection of sets of the dictionary keys (dictionary keys are set-like objects in Python 3):

dict_a = {"a": 1, "b": 2}
dict_b = {"a": 2, "c": 3} 

intersection = dict_a.keys() & dict_b.keys()  # {'a'}

On Python 2 you have to convert the dictionary keys to sets yourself:

keys_a = set(dict_a.keys())
keys_b = set(dict_b.keys())
intersection = keys_a & keys_b

Then given the intersection of the keys, you can then build the intersection of your values however is desired. You have to make a choice here, since the concept of set intersection doesn’t tell you what to do if the associated values differ. (This is presumably why the & intersection operator is not defined directly for dictionaries in Python).

In this case it sounds like your values for the same key would be equal, so you can just choose the value from one of the dictionaries:

dict_of_dicts_a = {"a": {"x":1}, "b": {"y":3}}
dict_of_dicts_b = {"a": {"x":1}, "c": {"z":4}} 

shared_keys = dict_of_dicts_a.keys() & dict_of_dicts_b.keys()

# values equal so choose values from a:
dict_intersection = {k: dict_of_dicts_a[k] for k in shared_keys }  # {"a":{"x":1}}

Other reasonable methods of combining values would depend on the types of the values in your dictionaries, and what they represent. For example you might also want the union of values for shared keys of dictionaries of dictionaries. Since the union of dictionaries doesn’t depend on the values, it is well defined, and in python you can get it using the | operator:

# union of values for each key in the intersection:
dict_intersection_2 = { k: dict_of_dicts_a[k] | dict_of_dicts_b[k] for k in shared_keys }

Which in this case, with identical dictionary values for key "a" in both, would be the same result.

Answered By: James

Just wrap the dictionary instances with a simple class that gets both of the values you want

class DictionaryIntersection(object):
    def __init__(self,dictA,dictB):
        self.dictA = dictA
        self.dictB = dictB

    def __getitem__(self,attr):
        if attr not in self.dictA or attr not in self.dictB:
            raise KeyError('Not in both dictionaries,key: %s' % attr)

        return self.dictA[attr],self.dictB[attr]

x = {'foo' : 5, 'bar' :6}
y = {'bar' : 'meow' , 'qux' : 8}

z = DictionaryIntersection(x,y)

print z['bar']
Answered By: Eric Urban

A little known fact is that you don’t need to construct sets to do this:

Python 3

d1 = {'a': 1, 'b': 2}    
d2 = {'b': 2, 'c': 3}    
print(d1.keys() & d2.keys()) # {'b'}

Python 2

In Python 2, we replace keys with viewkeys. The same applies to values (viewvalues) and items(viewitems).

In [78]: d1 = {'a': 1, 'b': 2}

In [79]: d2 = {'b': 2, 'c': 3}

In [80]: d1.viewkeys() & d2.viewkeys()
Out[80]: {'b'}

From the documentation of viewitems:

In [113]: d1.viewitems??
Type:       builtin_function_or_method
String Form:<built-in method viewitems of dict object at 0x64a61b0>
Docstring:  D.viewitems() -> a set-like object providing a view on D's items

For larger dicts this also slightly faster than constructing sets and then intersecting them:

In [122]: d1 = {i: rand() for i in range(10000)}

In [123]: d2 = {i: rand() for i in range(10000)}

In [124]: timeit d1.viewkeys() & d2.viewkeys()
1000 loops, best of 3: 714 µs per loop

In [125]: %%timeit
s1 = set(d1)
s2 = set(d2)
res = s1 & s2

1000 loops, best of 3: 805 µs per loop

For smaller `dict`s `set` construction is faster:

In [126]: d1 = {'a': 1, 'b': 2}

In [127]: d2 = {'b': 2, 'c': 3}

In [128]: timeit d1.viewkeys() & d2.viewkeys()
1000000 loops, best of 3: 591 ns per loop

In [129]: %%timeit
s1 = set(d1)
s2 = set(d2)
res = s1 & s2

1000000 loops, best of 3: 477 ns per loop

We’re comparing nanoseconds here, which may or may not matter to you. In any case, you get back a set, so using viewkeys/keys eliminates a bit of clutter.

Answered By: Phillip Cloud
In [1]: d1 = {'a':1, 'b':4, 'f':3}

In [2]: d2 = {'a':1, 'b':4, 'd':2}

In [3]: d = {x:d1[x] for x in d1 if x in d2}

In [4]: d
Out[4]: {'a': 1, 'b': 4}
Answered By: emnoor

Okay, here is a generalized version of code above in Python3.
It is optimized to use comprehensions and set-like dict views which are fast enough.

Function intersects arbitrary many dicts and returns a dict with common keys and a set of common values for each common key:

def dict_intersect(*dicts):
    comm_keys = dicts[0].keys()
    for d in dicts[1:]:
        # intersect keys first
        comm_keys &= d.keys()
    # then build a result dict with nested comprehension
    result = {key:{d[key] for d in dicts} for key in comm_keys}
    return result

Usage example:

a = {1: 'ba', 2: 'boon', 3: 'spam', 4:'eggs'}
b = {1: 'ham', 2:'baboon', 3: 'sausages'}
c = {1: 'more eggs', 3: 'cabbage'}

res = dict_intersect(a, b, c)
# Here is res (the order of values may vary) :
# {1: {'ham', 'more eggs', 'ba'}, 3: {'spam', 'sausages', 'cabbage'}}

Here the dict values must be hashable, if they aren’t you could simply change set parentheses { } to list [ ]:

result = {key:[d[key] for d in dicts] for key in comm_keys}
Answered By: thodnev

In Python 3, you can use

intersection = dict(dict1.items() & dict2.items())
union = dict(dict1.items() | dict2.items())
difference = dict(dict1.items() ^ dict2.items())
Answered By: dccsillag

Your question isn’t precise enough to give single answer.

1. Key Intersection

If you want to intersect IDs from posts (credits to James) do:

common_ids = p1.keys() & p2.keys()

However if you want to iterate documents you have to consider which post has a priority, I assume it’s p1. To iterate documents for common_ids, collections.ChainMap will be most useful:

from collections import ChainMap
intersection = {id: document
                for id, document in ChainMap(p1, p2)
                if id in common_ids}
for id, document in intersection:
    ...

Or if you don’t want to create separate intersection dictionary:

from collections import ChainMap
posts = ChainMap(p1, p2)
for id in common_ids:
    document = posts[id]

2. Items Intersection

If you want to intersect items of both posts, which means to match IDs and documents, use code below (credits to DCPY). However this is only useful if you’re looking for duplicates in terms.

duplicates = dict(p1.items() & p2.items())
for id, document in duplicates:
    ...

3. Iterate over p1 ‘AND’ p2.

In case when by “‘AND’ search” and using iter you meant to search both posts then again collections.ChainMap is the best to iterate over (almost) all items in multiple posts:

from collections import ChainMap
for id, document in ChainMap(p1, p2):
    ...
Answered By: WloHu
def two_keys(term_a, term_b, index):
    doc_ids = set(index[term_a].keys()) & set(index[term_b].keys())
    doc_store = index[term_a] # index[term_b] would work also
    return {doc_id: doc_store[doc_id] for doc_id in doc_ids}

def n_keys(terms, index):
    doc_ids = set.intersection(*[set(index[term].keys()) for term in terms])
    doc_store = index[term[0]]
    return {doc_id: doc_store[doc_id] for doc_id in doc_ids}

In [0]: index = {'a': {1: 'a b'}, 
                 'b': {1: 'a b'}}

In [1]: two_keys('a','b', index)
Out[1]: {1: 'a b'}

In [2]: n_keys(['a','b'], index)
Out[2]: {1: 'a b'}

I would recommend changing your index from

index = {term: {doc_id: doc}}

to two indexes one for the terms and then a separate index to hold the values

term_index = {term: set([doc_id])}
doc_store = {doc_id: doc}

that way you don’t store multiple copies of the same data

Answered By: Aaron Goldman

To find full intersection by keys and values

d1 = {'a':1}
d2 = {'b':2, 'a':1}
{x:d1[x] for x in d1 if x in d2 and d1[x] == d2[x]}

>> {'a':1}
Answered By: soldovskij

None of the solutions so far solve the general case of intersecting N dictionaries.

So, if you want to handle the intersection of N arbitrary dictionaries:

from functools import reduce

def dict_intersection(*dict_list):
    return reduce(lambda a,b: dict(a.items() & b.items()), dict_list)

a = {k:k for k in range(0,5)} # {0: 0, 1: 1, 2: 2, 3: 3, 4: 4}
b = {k:k for k in range(2,7)} # {2: 2, 3: 3, 4: 4, 5: 5, 6: 6}
c = {k:k for k in range(3,8)} # {3: 3, 4: 4, 5: 5, 6: 6, 7: 7}

dict_intersection(a,b,c)  # {3:3, 4:4}
# or if you have a list of dicts
dicts = [{k:k for k in range(0+n,5+n)} for n in (0,2,3)] # == [a,b,c]
dict_intersection(*dicts) # {3:3, 4:4}

Using functools.reduce allows for completing the operation within a single iteration over the list of dictionaries instead of the multiple loops in some solutions. It also doesn’t perform any additional conditional statements.

Trade-offs

Changing dict_intersection_v1 to dict_intersection_v2 we can see it performs faster for a larger lists of dictionaries and/or dictionaries (designing a proper experiment to test out which is a larger factor is outside the scope of this solution). This performance gain is due to reducing the amount of dictionary instantiations.

def dict_intersection_v1(*dict_list):
    return reduce(lambda a,b: dict(a.items() & b.items()),  dict_list)

def dict_intersection_v2(*dict_list):
    return dict(reduce(lambda a,b: a & b, (d.items() for d in dict_list)))

dict_lst1 = [{k:k for k in range(0+n,5+n)} for n in (0,2,3)] # = [a,b,c]
dict_lst2 = [{k:k for k in range(0,50,n)} for n in range(1,5)]]
dict_lst3 = [{k:k for k in range(0,500,n)} for n in range(40)]
dict_lst4 = [{k:k for k in range(0+n,500+n)} for n in range(400)]
dict list kv pair count dict_intersection_v1 dict_intersection_v2 relative difference
1 15 808 ns ± 4.31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 821 ns ± 0.785 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) + 1.6%
2 105 3.14 µs ± 11.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 2.38 µs ± 5.76 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) -24.2%
3 2155 36.9 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 25.1 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) -32.0%
4 200_000 9.08 ms ± 22 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.88 ms ± 5.31 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) -46.3%

The regression for result dict_lst1 is mainly due to difference in overhead between creating a dictionary after every intersection and the overhead due to dict.items() calls within the generator (and python’s general function call overhead).

NB: I did test using the pre-calculated list of dict.items() for the for a dictionary instead of v2’s building the generator on the fly.

I tested both passing in the pre-calculated list outside of the timings and within the timings, and, while it’s statistically significant, it’s less than 30 μs and 10 μs respectively. If you are trying to get these gains looking at a different language or Cython might be better.

Answered By: plunker