Filter a list of dictionaries based on another list of strings
Question:
I have a list of dictionaries such as below
data = [{'Person1':['a', 'b', 'c']}, {'Person2':['1', '2', '3']}, {'Person3':['x', 'y', 'z']}]
and it consists of almost 7000000 dictionaries. Then I have a list of strings such as
people = ['person1', 'person3']
that is of length 450000. All of the strings in this list exist as keys in the list of dictionaries.
What is the fastest/most efficient way to filter the list of dictionaries based on this list of strings, so as to get back a new dictionary that only contains keys that corresponds to the strings in the list such as
d = {'Person1':['a', 'b', 'c']}, 'Person3':['x', 'y', 'z']}
This is my code, but it takes a really long time to run, and I was wondering what the best way to approach this is.
d = {}
for p in people:
for i in data:
for k in i:
if p == k.lower():
d[p] = i[k]
Answers:
A first suggestion using dict-comprehension:
from collections import ChainMap
data = [
{'Person1':['a', 'b', 'c']},
{'Person2':['1', '2', '3']},
{'Person3':['x', 'y', 'z']}
]
people = ['Person1', 'Person3']
big_dict = dict(ChainMap(*data))
# drop duplicates
people = list(set(people))
smaller_dict = {person: big_dict[person] for person in people}
For ChainMap see here.
I used people
as a list (not as a set) because it has been reported that lists perform slightly faster in these cases.
Few benchmarks –
Nested For loops
%%timeit
out = []
for i in data:
for j in people:
if list(i.keys())[0]==j:
out.append(i)
#1.78 µs ± 55.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
List comprehension with in
%%timeit
out = [i for i in data if list(i.keys())[0] in people]
#1.02 µs ± 36.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
List comprehension with set.intersection
%%timeit
out = [i for i in data if set(i).intersection(people)]
#1.04 µs ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I have a list of dictionaries such as below
data = [{'Person1':['a', 'b', 'c']}, {'Person2':['1', '2', '3']}, {'Person3':['x', 'y', 'z']}]
and it consists of almost 7000000 dictionaries. Then I have a list of strings such as
people = ['person1', 'person3']
that is of length 450000. All of the strings in this list exist as keys in the list of dictionaries.
What is the fastest/most efficient way to filter the list of dictionaries based on this list of strings, so as to get back a new dictionary that only contains keys that corresponds to the strings in the list such as
d = {'Person1':['a', 'b', 'c']}, 'Person3':['x', 'y', 'z']}
This is my code, but it takes a really long time to run, and I was wondering what the best way to approach this is.
d = {}
for p in people:
for i in data:
for k in i:
if p == k.lower():
d[p] = i[k]
A first suggestion using dict-comprehension:
from collections import ChainMap
data = [
{'Person1':['a', 'b', 'c']},
{'Person2':['1', '2', '3']},
{'Person3':['x', 'y', 'z']}
]
people = ['Person1', 'Person3']
big_dict = dict(ChainMap(*data))
# drop duplicates
people = list(set(people))
smaller_dict = {person: big_dict[person] for person in people}
For ChainMap see here.
I used people
as a list (not as a set) because it has been reported that lists perform slightly faster in these cases.
Few benchmarks –
Nested For loops
%%timeit
out = []
for i in data:
for j in people:
if list(i.keys())[0]==j:
out.append(i)
#1.78 µs ± 55.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
List comprehension with in
%%timeit
out = [i for i in data if list(i.keys())[0] in people]
#1.02 µs ± 36.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
List comprehension with set.intersection
%%timeit
out = [i for i in data if set(i).intersection(people)]
#1.04 µs ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)