Optimization of average calculation from a list of dictionaries
Question:
I have a list of dictionaries, with keys ‘a’, ‘n’, ‘o’, ‘u’.
Is there a way to speed up this calculation, for instance with NumPy? There are tens of thousands of items in the list.
The data is drawn from a database, so I must live with that it’s in the form of a list of dictionaries originally.
x = n = o = u = 0
for entry in indata:
x += (entry['a']) * entry['n'] # n - number of data points
n += entry['n']
o += entry['o']
u += entry['u']
loops += 1
average = int(round(x / n)), n, o, u
Answers:
if all you’re looking to do is get an average value on something why not
sum_for_average = math.fsum(your_item)
average_of_list = sum_for_average / len(your_item)
no mucking about with numpy at all.
I doubt this will be much faster, but I suppose it’s a candidate for timeit
…
from operator import itemgetter
x = n = o = u = 0
items = itemgetter('a','n','o','u')
for entry in indata:
A,N,O,U = items(entry)
x += A*N # n - number of data points
n += N
o += O #don't know what you're doing with O or U, but I'll leave them
u += U
average = int(round(x / n)), n, o, u
At the very least, it saves a lookup of entry['n']
since I’ve now saved it to a variable
You could try something like this:
mean_a = np.sum(np.array([d['a'] for d in data]) * np.array([d['n'] for d in data])) / len(data)
EDIT: Actually, the method above from @mgilson is faster:
import numpy as np
from operator import itemgetter
from pandas import *
data=[]
for i in range(100000):
data.append({'a':np.random.random(), 'n':np.random.random(), 'o':np.random.random(), 'u':np.random.random()})
def func1(data):
x = n = o = u = 0
items = itemgetter('a','n','o','u')
for entry in data:
A,N,O,U = items(entry)
x += A*N # n - number of data points
n += N
o += O #don't know what you're doing with O or U, but I'll leave them
u += U
average = int(round(x / n)), n, o, u
return average
def func2(data):
mean_a = np.sum(np.array([d['a'] for d in data]) * np.array([d['n'] for d in data])/len(data)
return (mean_a,
np.sum([d['n'] for d in data]),
np.sum([d['o'] for d in data]),
np.sum([d['u'] for d in data])
)
def func3(data):
dframe = DataFrame(data)
return np.sum((dframe["a"]*dframe["n"])) / dframe.shape[0], np.sum(dframe["n"]), np.sum(dframe["o"]), np.sum(dframe["u"])
In [3]: %timeit func1(data)
10 loops, best of 3: 59.6 ms per loop
In [4]: %timeit func2(data)
10 loops, best of 3: 138 ms per loop
In [5]: %timeit func3(data)
10 loops, best of 3: 129 ms per loop
If you are doing other operations on the data, I would definitely look into using the Pandas package. It's DataFrame object is a nice match to the list of dictionaries that you are working with. I think that the majority of the overhead is IO operations of getting the data into numpy arrays or DataFrame objects.
I have a list of dictionaries, with keys ‘a’, ‘n’, ‘o’, ‘u’.
Is there a way to speed up this calculation, for instance with NumPy? There are tens of thousands of items in the list.
The data is drawn from a database, so I must live with that it’s in the form of a list of dictionaries originally.
x = n = o = u = 0
for entry in indata:
x += (entry['a']) * entry['n'] # n - number of data points
n += entry['n']
o += entry['o']
u += entry['u']
loops += 1
average = int(round(x / n)), n, o, u
if all you’re looking to do is get an average value on something why not
sum_for_average = math.fsum(your_item)
average_of_list = sum_for_average / len(your_item)
no mucking about with numpy at all.
I doubt this will be much faster, but I suppose it’s a candidate for timeit
…
from operator import itemgetter
x = n = o = u = 0
items = itemgetter('a','n','o','u')
for entry in indata:
A,N,O,U = items(entry)
x += A*N # n - number of data points
n += N
o += O #don't know what you're doing with O or U, but I'll leave them
u += U
average = int(round(x / n)), n, o, u
At the very least, it saves a lookup of entry['n']
since I’ve now saved it to a variable
You could try something like this:
mean_a = np.sum(np.array([d['a'] for d in data]) * np.array([d['n'] for d in data])) / len(data)
EDIT: Actually, the method above from @mgilson is faster:
import numpy as np from operator import itemgetter from pandas import *
data=[] for i in range(100000): data.append({'a':np.random.random(), 'n':np.random.random(), 'o':np.random.random(), 'u':np.random.random()})
def func1(data): x = n = o = u = 0 items = itemgetter('a','n','o','u') for entry in data: A,N,O,U = items(entry) x += A*N # n - number of data points n += N o += O #don't know what you're doing with O or U, but I'll leave them u += U
average = int(round(x / n)), n, o, u return average
def func2(data):
mean_a = np.sum(np.array([d['a'] for d in data]) * np.array([d['n'] for d in data])/len(data)
return (mean_a,
np.sum([d['n'] for d in data]),
np.sum([d['o'] for d in data]),
np.sum([d['u'] for d in data])
)def func3(data):
dframe = DataFrame(data)
return np.sum((dframe["a"]*dframe["n"])) / dframe.shape[0], np.sum(dframe["n"]), np.sum(dframe["o"]), np.sum(dframe["u"])In [3]: %timeit func1(data)
10 loops, best of 3: 59.6 ms per loopIn [4]: %timeit func2(data)
10 loops, best of 3: 138 ms per loopIn [5]: %timeit func3(data)
10 loops, best of 3: 129 ms per loop
If you are doing other operations on the data, I would definitely look into using the Pandas package. It's DataFrame object is a nice match to the list of dictionaries that you are working with. I think that the majority of the overhead is IO operations of getting the data into numpy arrays or DataFrame objects.