Filter a list of dictionaries based on another list of strings

Question:

I have a list of dictionaries such as below

data = [{'Person1':['a', 'b', 'c']}, {'Person2':['1', '2', '3']}, {'Person3':['x', 'y', 'z']}]

and it consists of almost 7000000 dictionaries. Then I have a list of strings such as

people = ['person1', 'person3']

that is of length 450000. All of the strings in this list exist as keys in the list of dictionaries.

What is the fastest/most efficient way to filter the list of dictionaries based on this list of strings, so as to get back a new dictionary that only contains keys that corresponds to the strings in the list such as

d = {'Person1':['a', 'b', 'c']}, 'Person3':['x', 'y', 'z']}

This is my code, but it takes a really long time to run, and I was wondering what the best way to approach this is.

d = {}

for p in people:
    for i in data:
        for k in i:
            if p == k.lower():
               d[p] = i[k]
Asked By: Paschalis

||

Answers:

A first suggestion using dict-comprehension:

from collections import ChainMap

data = [
    {'Person1':['a', 'b', 'c']}, 
    {'Person2':['1', '2', '3']}, 
    {'Person3':['x', 'y', 'z']}
]
people = ['Person1', 'Person3']

big_dict = dict(ChainMap(*data))

# drop duplicates
people = list(set(people))

smaller_dict = {person: big_dict[person] for person in people}

For ChainMap see here.
I used people as a list (not as a set) because it has been reported that lists perform slightly faster in these cases.

Answered By: Durtal

Few benchmarks –

Nested For loops

%%timeit

out = []
for i in data:
    for j in people:
        if list(i.keys())[0]==j:
            out.append(i)
            
#1.78 µs ± 55.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)            

List comprehension with in

%%timeit

out = [i for i in data if list(i.keys())[0] in people]

#1.02 µs ± 36.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

List comprehension with set.intersection

%%timeit

out = [i for i in data if set(i).intersection(people)]

#1.04 µs ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Answered By: Akshay Sehgal
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.