Pandas groupby multiple fields then diff

Question:

So my dataframe looks like this:

         date    site country  score
0  2018-01-01  google      us    100
1  2018-01-01  google      ch     50
2  2018-01-02  google      us     70
3  2018-01-03  google      us     60
4  2018-01-02  google      ch     10
5  2018-01-01      fb      us     50
6  2018-01-02      fb      us     55
7  2018-01-03      fb      us    100
8  2018-01-01      fb      es    100
9  2018-01-02      fb      gb    100

Each site has a different score depending on the country. I’m trying to find the 1/3/5-day difference of scores for each site/country combination.

Output should be:

          date    site country  score  diff
8  2018-01-01      fb      es    100   0.0
9  2018-01-02      fb      gb    100   0.0
5  2018-01-01      fb      us     50   0.0
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   0.0
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   0.0
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

I first tried sorting by site/country/date, then grouping by site and country but I’m not able to wrap my head around getting a difference from a grouped object.

Asked By: Craig

||

Answers:

First, sort the DataFrame and then all you need is groupby.diff():

df = df.sort_values(by=['site', 'country', 'date'])

df['diff'] = df.groupby(['site', 'country'])['score'].diff().fillna(0)

df
Out: 
         date    site country  score  diff
8  2018-01-01      fb      es    100   0.0
9  2018-01-02      fb      gb    100   0.0
5  2018-01-01      fb      us     50   0.0
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   0.0
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   0.0
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

sort_values doesn’t support arbitrary orderings. If you need to sort arbitrarily (google before fb for example) you need to store them in a collection and set your column as categorical. Then sort_values will respect the ordering you provided there.

Answered By: ayhan

You can shift and substract grouped values:

df.sort_values(['site', 'country', 'date'], inplace=True)

df['diff'] = df['score'] - df.groupby(['site', 'country'])['score'].shift()

Result:

         date    site country  score  diff
8  2018-01-01      fb      es    100   NaN
9  2018-01-02      fb      gb    100   NaN
5  2018-01-01      fb      us     50   NaN
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   NaN
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   NaN
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

To fill NaN with 0 use df['diff'].fillna(0, inplace=True).

Answered By: Mykola Zotko
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.