Why is pandas.Series.std() different from numpy.std()?

Question:

This is what I am trying to explain:

>>> a = pd.Series([7, 20, 22, 22])
>>> a.std()
7.2284161474004804
>>> np.std(a)
6.2599920127744575

I have data about many different restaurants. For simplicity I have extracted just one restaurant with four items:

>>> df
    restaurant_id  price
id                      
1           10407      7
3           10407     20
6           10407     22
13          10407     22

For each restaurant, I want to get the standard deviation, however, Pandas returns wrong values.

>>> df.groupby('restaurant_id').std()
                  price
restaurant_id          
10407          7.228416

We can get the correct value with np.std():

>>> np.std(df['price'])
6.2599920127744575

But obviously, this is not a solution when I have more than one restaurant. How do I do this properly?


Just to make sure, I checked that df['price'].mean() == np.mean(df['price']).

There is a related discussion here, but their suggestions do not work either.

Asked By: Sergey Orshanskiy

||

Answers:

Pandas std is using Bessel’s correction by default — that is, the standard deviation formula with N-1 instead of N in the denominator. To use N-0:

a.std(ddof=0) == np.std(a)
Answered By: Sergey Orshanskiy
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.