Pandas Sum of Duplicate Attributes
Question:
I’m using Pandas to manipulate a csv file with several rows and columns that looks like the following
Fullname Amount Date Zip State .....
John Joe 1 1/10/1900 55555 Confusion
Betty White 5 . . Alaska
Bruce Wayne 10 . . Frustration
John Joe 20 . . .
Betty White 25 . . .
I’d like to create a new column entitled Total
with a total sum of amount for each person. (Identified by Fullname
and Zip
). I’m having difficulty in finding the correct solution.
Let’s just call my csv import csvfile. Here is what I have.
import Pandas
df = pandas.read_csv('csvfile.csv', header = 0)
df.sort(['fullname'])
I think I have to use the iterrows to do what I want as an object. The problem with dropping duplicates is that I will lose the amount or the amount may be different.
Answers:
I think you want this:
df['Total'] = df.groupby(['Fullname', 'Zip'])['Amount'].transform('sum')
So groupby
will group by the Fullname
and zip
columns, as you’ve stated, we then call transform
on the Amount
column and calculate the total amount by passing in the string sum
, this will return a series with the index aligned to the original df
, you can then drop the duplicates afterwards. e.g.
new_df = df.drop_duplicates(subset=['Fullname', 'Zip'])
Consider using one of following
df = df.groupby(['Fullname', 'Zip'], as_index=False)['Amount'].sum()
df = df.groupby(['Fullname', 'Zip'], as_index=False)['Amount'].cumsum()
df = df.groupby(['Fullname', 'Zip'], as_index=False)['Amount'].agg('sum')
All three methods store the result in Amount
column. Since the meaning of column changes, you could rename the column to another with df.rename()
df = df.rename(columns={'Amount':'Total'})
If you want to keep one value from other columns, you could use agg()
, which accepts a dict of axis labels -> functions that specifies what operation should be performed for each column.
df.groupby(['Fullname', 'Zip'], as_index=False).agg({'Amount': 'sum', 'State': 'first'})
I’m using Pandas to manipulate a csv file with several rows and columns that looks like the following
Fullname Amount Date Zip State .....
John Joe 1 1/10/1900 55555 Confusion
Betty White 5 . . Alaska
Bruce Wayne 10 . . Frustration
John Joe 20 . . .
Betty White 25 . . .
I’d like to create a new column entitled Total
with a total sum of amount for each person. (Identified by Fullname
and Zip
). I’m having difficulty in finding the correct solution.
Let’s just call my csv import csvfile. Here is what I have.
import Pandas
df = pandas.read_csv('csvfile.csv', header = 0)
df.sort(['fullname'])
I think I have to use the iterrows to do what I want as an object. The problem with dropping duplicates is that I will lose the amount or the amount may be different.
I think you want this:
df['Total'] = df.groupby(['Fullname', 'Zip'])['Amount'].transform('sum')
So groupby
will group by the Fullname
and zip
columns, as you’ve stated, we then call transform
on the Amount
column and calculate the total amount by passing in the string sum
, this will return a series with the index aligned to the original df
, you can then drop the duplicates afterwards. e.g.
new_df = df.drop_duplicates(subset=['Fullname', 'Zip'])
Consider using one of following
df = df.groupby(['Fullname', 'Zip'], as_index=False)['Amount'].sum()
df = df.groupby(['Fullname', 'Zip'], as_index=False)['Amount'].cumsum()
df = df.groupby(['Fullname', 'Zip'], as_index=False)['Amount'].agg('sum')
All three methods store the result in Amount
column. Since the meaning of column changes, you could rename the column to another with df.rename()
df = df.rename(columns={'Amount':'Total'})
If you want to keep one value from other columns, you could use agg()
, which accepts a dict of axis labels -> functions that specifies what operation should be performed for each column.
df.groupby(['Fullname', 'Zip'], as_index=False).agg({'Amount': 'sum', 'State': 'first'})