What's the best way to sum all values in a Pandas dataframe?
Question:
I figured out these two methods. Is there a better one?
>>> import pandas as pd
>>> df = pd.DataFrame({'A': [5, 6, 7], 'B': [7, 8, 9]})
>>> print df.sum().sum()
42
>>> print df.values.sum()
42
Just want to make sure I’m not missing something more obvious.
Answers:
Updated for Pandas 0.24+
df.to_numpy().sum()
Prior to Pandas 0.24+
df.values
Is the underlying numpy array
df.values.sum()
Is the numpy sum method and is faster
Adding some numbers to support this:
import numpy as np, pandas as pd
import timeit
df = pd.DataFrame(np.arange(int(1e6)).reshape(500000, 2), columns=list("ab"))
def pandas_test():
return df['a'].sum()
def numpy_test():
return df['a'].to_numpy().sum()
timeit.timeit(numpy_test, number=1000) # 0.5032469799989485
timeit.timeit(pandas_test, number=1000) # 0.6035906639990571
So we get a 20% performance on my machine just for Series summations!
I figured out these two methods. Is there a better one?
>>> import pandas as pd
>>> df = pd.DataFrame({'A': [5, 6, 7], 'B': [7, 8, 9]})
>>> print df.sum().sum()
42
>>> print df.values.sum()
42
Just want to make sure I’m not missing something more obvious.
Updated for Pandas 0.24+
df.to_numpy().sum()
Prior to Pandas 0.24+
df.values
Is the underlying numpy array
df.values.sum()
Is the numpy sum method and is faster
Adding some numbers to support this:
import numpy as np, pandas as pd
import timeit
df = pd.DataFrame(np.arange(int(1e6)).reshape(500000, 2), columns=list("ab"))
def pandas_test():
return df['a'].sum()
def numpy_test():
return df['a'].to_numpy().sum()
timeit.timeit(numpy_test, number=1000) # 0.5032469799989485
timeit.timeit(pandas_test, number=1000) # 0.6035906639990571
So we get a 20% performance on my machine just for Series summations!