Pandas Groupby: Count and mean combined
Question:
Working with pandas to try and summarise a data frame as a count of certain categories, as well as the means sentiment score for these categories.
There is a table full of strings that have different sentiment scores, and I want to group each text source by saying how many posts they have, as well as the average sentiment of these posts.
My (simplified) data frame looks like this:
source text sent
--------------------------------
bar some string 0.13
foo alt string -0.8
bar another str 0.7
foo some text -0.2
foo more text -0.5
The output from this should be something like this:
source count mean_sent
-----------------------------
foo 3 -0.5
bar 2 0.415
The answer is somewhere along the lines of:
df['sent'].groupby(df['source']).mean()
Yet only gives each source and it’s mean, with no column headers.
Answers:
I think this should provide the output that you wanted:
result = pd.DataFrame(df.groupby('source').size())
results['mean_score'] = df.groupby('source').sent.mean()
In newer versions of pandas you don’t need the rename anymore, just use named aggregation:
df = df.groupby('source')
.agg(count=('text', 'size'), mean_sent=('sent', 'mean'))
.reset_index()
print (df)
source count mean_sent
0 bar 2 0.415
1 foo 3 -0.500
Below one should work fine:
df[['source','sent']].groupby('source').agg(['count','mean'])
A shorter version to achieve this is:
df.groupby('source')['sent'].agg(count='size', mean_sent='mean').reset_index()
The nice thing about this is that you can extend it if you want to take the mean of multiple variables but only count once. In this case you will have to pass a dictionary:
df.groupby('source')['sent1', 'sent2'].agg({'count': 'size', 'means': 'mean'}).reset_index()
For those who were looking for aggregations for more than two columns (as I were): just add those to ‘agg’.
df = df.groupby(['id']).agg({'texts': 'size', 'char_num': 'mean', 'bytes': 'mean'}).reset_index()
Working with pandas to try and summarise a data frame as a count of certain categories, as well as the means sentiment score for these categories.
There is a table full of strings that have different sentiment scores, and I want to group each text source by saying how many posts they have, as well as the average sentiment of these posts.
My (simplified) data frame looks like this:
source text sent
--------------------------------
bar some string 0.13
foo alt string -0.8
bar another str 0.7
foo some text -0.2
foo more text -0.5
The output from this should be something like this:
source count mean_sent
-----------------------------
foo 3 -0.5
bar 2 0.415
The answer is somewhere along the lines of:
df['sent'].groupby(df['source']).mean()
Yet only gives each source and it’s mean, with no column headers.
I think this should provide the output that you wanted:
result = pd.DataFrame(df.groupby('source').size())
results['mean_score'] = df.groupby('source').sent.mean()
In newer versions of pandas you don’t need the rename anymore, just use named aggregation:
df = df.groupby('source')
.agg(count=('text', 'size'), mean_sent=('sent', 'mean'))
.reset_index()
print (df)
source count mean_sent
0 bar 2 0.415
1 foo 3 -0.500
Below one should work fine:
df[['source','sent']].groupby('source').agg(['count','mean'])
A shorter version to achieve this is:
df.groupby('source')['sent'].agg(count='size', mean_sent='mean').reset_index()
The nice thing about this is that you can extend it if you want to take the mean of multiple variables but only count once. In this case you will have to pass a dictionary:
df.groupby('source')['sent1', 'sent2'].agg({'count': 'size', 'means': 'mean'}).reset_index()
For those who were looking for aggregations for more than two columns (as I were): just add those to ‘agg’.
df = df.groupby(['id']).agg({'texts': 'size', 'char_num': 'mean', 'bytes': 'mean'}).reset_index()