Finding matching keys in two large dictionaries and doing it fast

Question:

I am trying to find corresponding keys in two different dictionaries. Each has about 600k entries.

Say for example:

    myRDP = { 'Actinobacter': 'GATCGA...TCA', 'subtilus sp.': 'ATCGATT...ACT' }
    myNames = { 'Actinobacter': '8924342' }

I want to print out the value for Actinobacter (8924342) since it matches a value in myRDP.

The following code works, but is very slow:

    for key in myRDP:
        for jey in myNames:
            if key == jey:
                print key, myNames[key]

I’ve tried the following but it always results in a KeyError:

    for key in myRDP:
        print myNames[key]

Is there perhaps a function implemented in C for doing this? I’ve googled around but nothing seems to work.

Asked By: Austin Richardson

||

Answers:

You could do this:

for key in myRDP:
    if key in myNames:
        print key, myNames[key]

Your first attempt was slow because you were comparing every key in myRDP with every key in myNames. In algorithmic jargon, if myRDP has n elements and myNames has m elements, then that algorithm would take O(n×m) operations. For 600k elements each, this is 360,000,000,000 comparisons!

But testing whether a particular element is a key of a dictionary is fast — in fact, this is one of the defining characteristics of dictionaries. In algorithmic terms, the key in dict test is O(1), or constant-time. So my algorithm will take O(n) time, which is one 600,000th of the time.

Answered By: John Fouhy
for key in myRDP:
    name = myNames.get(key, None)
    if name:
        print key, name

dict.get returns the default value you give it (in this case, None) if the key doesn’t exist.

Answered By: Andrew Keeton

Use the get method instead:

 for key in myRDP:
    value = myNames.get(key)
    if value != None:
      print key, "=", value
Answered By: João Silva

Copy both dictionaries into one dictionary/array. This makes sense as you have 1:1 related values. Then you need only one search, no comparison loop, and can access the related value directly.

Example Resulting Dictionary/Array:

[Name][Value1][Value2]

[Actinobacter][GATCGA...TCA][8924342]

[XYZbacter][BCABCA...ABC][43594344]

Answered By: Alex

Use sets, because they have a built-in intersection method which ought to be quick:

myRDP = { 'Actinobacter': 'GATCGA...TCA', 'subtilus sp.': 'ATCGATT...ACT' }
myNames = { 'Actinobacter': '8924342' }

rdpSet = set(myRDP)
namesSet = set(myNames)

for name in rdpSet.intersection(namesSet):
    print name, myNames[name]

# Prints: Actinobacter 8924342
Answered By: RichieHindle

You could start by finding the common keys and then iterating over them. Set operations should be fast because they are implemented in C, at least in modern versions of Python.

common_keys = set(myRDP).intersection(myNames)
for key in common_keys:
    print key, myNames[key]
Answered By: Roberto Bonvallet

Here is my code for doing intersections, unions, differences, and other set operations on dictionaries:

class DictDiffer(object):
    """
    Calculate the difference between two dictionaries as:
    (1) items added
    (2) items removed
    (3) keys same in both but changed values
    (4) keys same in both and unchanged values
    """
    def __init__(self, current_dict, past_dict):
        self.current_dict, self.past_dict = current_dict, past_dict
        self.set_current, self.set_past = set(current_dict.keys()), set(past_dict.keys())
        self.intersect = self.set_current.intersection(self.set_past)
    def added(self):
        return self.set_current - self.intersect 
    def removed(self):
        return self.set_past - self.intersect 
    def changed(self):
        return set(o for o in self.intersect if self.past_dict[o] != self.current_dict[o])
    def unchanged(self):
        return set(o for o in self.intersect if self.past_dict[o] == self.current_dict[o])

if __name__ == '__main__':
    import unittest
    class TestDictDifferNoChanged(unittest.TestCase):
        def setUp(self):
            self.past = dict((k, 2*k) for k in range(5))
            self.current = dict((k, 2*k) for k in range(3,8))
            self.d = DictDiffer(self.current, self.past)
        def testAdded(self):
            self.assertEqual(self.d.added(), set((5,6,7)))
        def testRemoved(self):      
            self.assertEqual(self.d.removed(), set((0,1,2)))
        def testChanged(self):
            self.assertEqual(self.d.changed(), set())
        def testUnchanged(self):
            self.assertEqual(self.d.unchanged(), set((3,4)))
    class TestDictDifferNoCUnchanged(unittest.TestCase):
        def setUp(self):
            self.past = dict((k, 2*k) for k in range(5))
            self.current = dict((k, 2*k+1) for k in range(3,8))
            self.d = DictDiffer(self.current, self.past)
        def testAdded(self):
            self.assertEqual(self.d.added(), set((5,6,7)))
        def testRemoved(self):      
            self.assertEqual(self.d.removed(), set((0,1,2)))
        def testChanged(self):
            self.assertEqual(self.d.changed(), set((3,4)))
        def testUnchanged(self):
            self.assertEqual(self.d.unchanged(), set())
    unittest.main()
Answered By: hughdbrown

in python 3 you can just do

myNames.keys() & myRDP.keys()

Answered By: boson

Best and easiest way would be simply perform common set operations(Python 3).

a = {"a": 1, "b":2, "c":3, "d":4}
b = {"t1": 1, "b":2, "e":5, "c":3}
res = a.items() & b.items() # {('b', 2), ('c', 3)} For common Key and Value
res = {i[0]:i[1] for i in res}  # In dict format
common_keys = a.keys() & b.keys()  # {'b', 'c'}

Cheers!

Answered By: vikas0713
def combine_two_json(json_request, json_request2):
intersect = {}

for item in json_request.keys():
    if item in json_request2.keys():
        intersect[item]=json_request2.get(item)
return intersect
Answered By: balaji karthi keyan

You can simply write this code and it will save the common key in a list.
common = [i for i in myRDP.keys() if i in myNames.keys()]

Answered By: Aqeel Yasin
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.