Pandas Groupby: Count and mean combined

Question:

Working with pandas to try and summarise a data frame as a count of certain categories, as well as the means sentiment score for these categories.

There is a table full of strings that have different sentiment scores, and I want to group each text source by saying how many posts they have, as well as the average sentiment of these posts.

My (simplified) data frame looks like this:

source    text              sent
--------------------------------
bar       some string       0.13
foo       alt string        -0.8
bar       another str       0.7
foo       some text         -0.2
foo       more text         -0.5

The output from this should be something like this:

source    count     mean_sent
-----------------------------
foo       3         -0.5
bar       2         0.415

The answer is somewhere along the lines of:

df['sent'].groupby(df['source']).mean()

Yet only gives each source and it’s mean, with no column headers.

Asked By: Lewis Anderson

||

Answers:

You can use groupby with aggregate:

df = df.groupby('source') 
       .agg({'text':'size', 'sent':'mean'}) 
       .rename(columns={'text':'count','sent':'mean_sent'}) 
       .reset_index()
print (df)
  source  count  mean_sent
0    bar      2      0.415
1    foo      3     -0.500
Answered By: jezrael

I think this should provide the output that you wanted:

result = pd.DataFrame(df.groupby('source').size())

results['mean_score'] =  df.groupby('source').sent.mean()
Answered By: galitbw

In newer versions of pandas you don’t need the rename anymore, just use named aggregation:

df = df.groupby('source') 
       .agg(count=('text', 'size'), mean_sent=('sent', 'mean')) 
       .reset_index()

print (df)
  source  count  mean_sent
0    bar      2      0.415
1    foo      3     -0.500
Answered By: neves

Below one should work fine:

df[['source','sent']].groupby('source').agg(['count','mean'])
Answered By: Ojha

A shorter version to achieve this is:

df.groupby('source')['sent'].agg(count='size', mean_sent='mean').reset_index()

The nice thing about this is that you can extend it if you want to take the mean of multiple variables but only count once. In this case you will have to pass a dictionary:

df.groupby('source')['sent1', 'sent2'].agg({'count': 'size', 'means': 'mean'}).reset_index()
Answered By: gasteigerjo

For those who were looking for aggregations for more than two columns (as I were): just add those to ‘agg’.

df = df.groupby(['id']).agg({'texts': 'size', 'char_num': 'mean', 'bytes': 'mean'}).reset_index()
Answered By: João