Python: Elegantly merge dictionaries with sum() of values
Question:
I’m trying to merge logs from several servers. Each log is a list of tuples (date
, count
). date
may appear more than once, and I want the resulting dictionary to hold the sum of all counts from all servers.
Here’s my attempt, with some data for example:
from collections import defaultdict
a=[("13.5",100)]
b=[("14.5",100), ("15.5", 100)]
c=[("15.5",100), ("16.5", 100)]
input=[a,b,c]
output=defaultdict(int)
for d in input:
for item in d:
output[item[0]]+=item[1]
print dict(output)
Which gives:
{'14.5': 100, '16.5': 100, '13.5': 100, '15.5': 200}
As expected.
I’m about to go bananas because of a colleague who saw the code. She insists that there must be a more Pythonic and elegant way to do it, without these nested for loops. Any ideas?
Answers:
You could use itertools’ groupby:
from itertools import groupby, chain
a=[("13.5",100)]
b=[("14.5",100), ("15.5", 100)]
c=[("15.5",100), ("16.5", 100)]
input = sorted(chain(a,b,c), key=lambda x: x[0])
output = {}
for k, g in groupby(input, key=lambda x: x[0]):
output[k] = sum(x[1] for x in g)
print output
The use of groupby
instead of two loops and a defaultdict
will make your code clearer.
from collections import Counter
a = [("13.5",100)]
b = [("14.5",100), ("15.5", 100)]
c = [("15.5",100), ("16.5", 100)]
inp = [dict(x) for x in (a,b,c)]
count = Counter()
for y in inp:
count += Counter(y)
print(count)
output:
Counter({'15.5': 200, '14.5': 100, '16.5': 100, '13.5': 100})
Edit:
As duncan suggested you can replace these 3 lines with a single line:
count = Counter()
for y in inp:
count += Counter(y)
replace by : count = sum((Counter(y) for y in inp), Counter())
Doesn’t get simpler than this, I think:
a=[("13.5",100)]
b=[("14.5",100), ("15.5", 100)]
c=[("15.5",100), ("16.5", 100)]
input=[a,b,c]
from collections import Counter
print sum(
(Counter(dict(x)) for x in input),
Counter())
Note that Counter
(also known as a multiset) is the most natural data structure for your data (a type of set to which elements can belong more than once, or equivalently – a map with semantics Element -> OccurrenceCount. You could have used it in the first place, instead of lists of tuples.
Also possible:
from collections import Counter
from operator import add
print reduce(add, (Counter(dict(x)) for x in input))
Using reduce(add, seq)
instead of sum(seq, initialValue)
is generally more flexible and allows you to skip passing the redundant initial value.
Note that you could also use operator.and_
to find the intersection of the multisets instead of the sum.
The above variant is terribly slow, because a new Counter is created on every step. Let’s fix that.
We know that Counter+Counter
returns a new Counter
with merged data. This is OK, but we want to avoid extra creation. Let’s use Counter.update
instead:
update(self, iterable=None, **kwds) unbound collections.Counter method
Like dict.update() but add counts instead of replacing them.
Source can be an iterable, a dictionary, or another Counter instance.
That’s what we want. Let’s wrap it with a function compatible with reduce
and see what happens.
def updateInPlace(a,b):
a.update(b)
return a
print reduce(updateInPlace, (Counter(dict(x)) for x in input))
This is only marginally slower than the OP’s solution.
Benchmark: http://ideone.com/7IzSx (Updated with yet another solution, thanks to astynax)
(Also: If you desperately want an one-liner, you can replace updateInPlace
by lambda x,y: x.update(y) or x
which works the same way and even proves to be a split second faster, but fails at readability. Don’t :-))
You can use Counter or defaultdict, or you can try my variant:
def merge_with(d1, d2, fn=lambda x, y: x + y):
res = d1.copy() # "= dict(d1)" for lists of tuples
for key, val in d2.iteritems(): # ".. in d2" for lists of tuples
try:
res[key] = fn(res[key], val)
except KeyError:
res[key] = val
return res
>>> merge_with({'a':1, 'b':2}, {'a':3, 'c':4})
{'a': 4, 'c': 4, 'b': 2}
Or even more generic:
def make_merger(fappend=lambda x, y: x + y, fempty=lambda x: x):
def inner(*dicts):
res = dict((k, fempty(v)) for k, v
in dicts[0].iteritems()) # ".. in dicts[0]" for lists of tuples
for dic in dicts[1:]:
for key, val in dic.iteritems(): # ".. in dic" for lists of tuples
try:
res[key] = fappend(res[key], val)
except KeyError:
res[key] = fempty(val)
return res
return inner
>>> make_merger()({'a':1, 'b':2}, {'a':3, 'c':4})
{'a': 4, 'c': 4, 'b': 2}
>>> appender = make_merger(lambda x, y: x + [y], lambda x: [x])
>>> appender({'a':1, 'b':2}, {'a':3, 'c':4}, {'b':'BBB', 'c':'CCC'})
{'a': [1, 3], 'c': [4, 'CCC'], 'b': [2, 'BBB']}
Also you can subclass the dict
and implement a __add__
method:
I’m trying to merge logs from several servers. Each log is a list of tuples (date
, count
). date
may appear more than once, and I want the resulting dictionary to hold the sum of all counts from all servers.
Here’s my attempt, with some data for example:
from collections import defaultdict
a=[("13.5",100)]
b=[("14.5",100), ("15.5", 100)]
c=[("15.5",100), ("16.5", 100)]
input=[a,b,c]
output=defaultdict(int)
for d in input:
for item in d:
output[item[0]]+=item[1]
print dict(output)
Which gives:
{'14.5': 100, '16.5': 100, '13.5': 100, '15.5': 200}
As expected.
I’m about to go bananas because of a colleague who saw the code. She insists that there must be a more Pythonic and elegant way to do it, without these nested for loops. Any ideas?
You could use itertools’ groupby:
from itertools import groupby, chain
a=[("13.5",100)]
b=[("14.5",100), ("15.5", 100)]
c=[("15.5",100), ("16.5", 100)]
input = sorted(chain(a,b,c), key=lambda x: x[0])
output = {}
for k, g in groupby(input, key=lambda x: x[0]):
output[k] = sum(x[1] for x in g)
print output
The use of groupby
instead of two loops and a defaultdict
will make your code clearer.
from collections import Counter
a = [("13.5",100)]
b = [("14.5",100), ("15.5", 100)]
c = [("15.5",100), ("16.5", 100)]
inp = [dict(x) for x in (a,b,c)]
count = Counter()
for y in inp:
count += Counter(y)
print(count)
output:
Counter({'15.5': 200, '14.5': 100, '16.5': 100, '13.5': 100})
Edit:
As duncan suggested you can replace these 3 lines with a single line:
count = Counter()
for y in inp:
count += Counter(y)
replace by : count = sum((Counter(y) for y in inp), Counter())
Doesn’t get simpler than this, I think:
a=[("13.5",100)]
b=[("14.5",100), ("15.5", 100)]
c=[("15.5",100), ("16.5", 100)]
input=[a,b,c]
from collections import Counter
print sum(
(Counter(dict(x)) for x in input),
Counter())
Note that Counter
(also known as a multiset) is the most natural data structure for your data (a type of set to which elements can belong more than once, or equivalently – a map with semantics Element -> OccurrenceCount. You could have used it in the first place, instead of lists of tuples.
Also possible:
from collections import Counter
from operator import add
print reduce(add, (Counter(dict(x)) for x in input))
Using reduce(add, seq)
instead of sum(seq, initialValue)
is generally more flexible and allows you to skip passing the redundant initial value.
Note that you could also use operator.and_
to find the intersection of the multisets instead of the sum.
The above variant is terribly slow, because a new Counter is created on every step. Let’s fix that.
We know that Counter+Counter
returns a new Counter
with merged data. This is OK, but we want to avoid extra creation. Let’s use Counter.update
instead:
update(self, iterable=None, **kwds) unbound collections.Counter method
Like dict.update() but add counts instead of replacing them.
Source can be an iterable, a dictionary, or another Counter instance.
That’s what we want. Let’s wrap it with a function compatible with reduce
and see what happens.
def updateInPlace(a,b):
a.update(b)
return a
print reduce(updateInPlace, (Counter(dict(x)) for x in input))
This is only marginally slower than the OP’s solution.
Benchmark: http://ideone.com/7IzSx (Updated with yet another solution, thanks to astynax)
(Also: If you desperately want an one-liner, you can replace updateInPlace
by lambda x,y: x.update(y) or x
which works the same way and even proves to be a split second faster, but fails at readability. Don’t :-))
You can use Counter or defaultdict, or you can try my variant:
def merge_with(d1, d2, fn=lambda x, y: x + y):
res = d1.copy() # "= dict(d1)" for lists of tuples
for key, val in d2.iteritems(): # ".. in d2" for lists of tuples
try:
res[key] = fn(res[key], val)
except KeyError:
res[key] = val
return res
>>> merge_with({'a':1, 'b':2}, {'a':3, 'c':4})
{'a': 4, 'c': 4, 'b': 2}
Or even more generic:
def make_merger(fappend=lambda x, y: x + y, fempty=lambda x: x):
def inner(*dicts):
res = dict((k, fempty(v)) for k, v
in dicts[0].iteritems()) # ".. in dicts[0]" for lists of tuples
for dic in dicts[1:]:
for key, val in dic.iteritems(): # ".. in dic" for lists of tuples
try:
res[key] = fappend(res[key], val)
except KeyError:
res[key] = fempty(val)
return res
return inner
>>> make_merger()({'a':1, 'b':2}, {'a':3, 'c':4})
{'a': 4, 'c': 4, 'b': 2}
>>> appender = make_merger(lambda x, y: x + [y], lambda x: [x])
>>> appender({'a':1, 'b':2}, {'a':3, 'c':4}, {'b':'BBB', 'c':'CCC'})
{'a': [1, 3], 'c': [4, 'CCC'], 'b': [2, 'BBB']}
Also you can subclass the dict
and implement a __add__
method: