Python – How to do accumulative sums depending on the value of a column
Question:
I have a dataframe and I want to add a column that should be the accumulative sum of one of the columns but only if the value of another column is a specific one.
For example, my dataframe is as follows:
| Type | Quantity |
| A | 30 |
| B | 10 |
| B | 5 |
| A | 3 |
I would like to add a column SumA
that would only do the accumulative sum of the quantities when Type == A
.
I have tried this:
data['SumA'] = data['Quantity'].cumsum() if data[(data['Type'] == 'A')]
I keep getting errors and I’m not sure how I can solve them, could someone please give me a hand?
I would like to get something like this:
| Type | Quantity | Sum A | Sum B |
| A | 30 | 30 | 0 |
| B | 10 | 30 | 10 |
| B | 5 | 30 | 15 |
| A | 3 | 33 | 15 |
Answers:
The error you are getting here is a syntax error. Pandas does not support selection for rows with the if command.
Instead to select the rows you want you can do this:
data[(data['Type'] == 'A')]['Quantity']
This will show the quantity column of the rows that have Type equal to ‘A’
So in your case in order for this code to work this will become:
data['sumA'] = data[(data['Type'] == 'A')]['Quantity'].cumsum()
In order to get the expected output you just need to do this twice for columns A and B and fill any missing nan value.
data['sumA'] = data[(data['Type'] == 'A')]['Quantity'].cumsum()
data['sumB'] = data[(data['Type'] == 'B')]['Quantity'].cumsum()
# Fill nan values with the previously available value
data.fillna(method='ffill', inplace=True)
# The first values don't have any previous value, so fill with zero
data.fillna(value=0, inplace=True)
This returns the expected value
I thought about somewhat general solution which can surely be optimized (I will try and continue and work on it):
So we iterate over the unique values of our Type
column to create sum{value}
column, then each column will consist the cumsum
of their respected Type
value while non matching values will be NaN
.
Then I fill the NaN
values with the nearest valid value and the last row is to satisfy the special case where the first item in the row is NaN
and needs to be 0
for column in data['Type'].unique():
column_name = f'sum{column}'
data[column_name] = data[data['Type'] == column]['Quantity'].cumsum()
data[column_name].fillna(method='ffill', inplace=True)
data[column_name].fillna(value=0, inplace=True)
output:
Type Quantity sumA sumB
0 A 30 30.0 0.0
1 B 10 30.0 10.0
2 B 5 30.0 15.0
3 A 3 33.0 15.0
I have a dataframe and I want to add a column that should be the accumulative sum of one of the columns but only if the value of another column is a specific one.
For example, my dataframe is as follows:
| Type | Quantity |
| A | 30 |
| B | 10 |
| B | 5 |
| A | 3 |
I would like to add a column SumA
that would only do the accumulative sum of the quantities when Type == A
.
I have tried this:
data['SumA'] = data['Quantity'].cumsum() if data[(data['Type'] == 'A')]
I keep getting errors and I’m not sure how I can solve them, could someone please give me a hand?
I would like to get something like this:
| Type | Quantity | Sum A | Sum B |
| A | 30 | 30 | 0 |
| B | 10 | 30 | 10 |
| B | 5 | 30 | 15 |
| A | 3 | 33 | 15 |
The error you are getting here is a syntax error. Pandas does not support selection for rows with the if command.
Instead to select the rows you want you can do this:
data[(data['Type'] == 'A')]['Quantity']
This will show the quantity column of the rows that have Type equal to ‘A’
So in your case in order for this code to work this will become:
data['sumA'] = data[(data['Type'] == 'A')]['Quantity'].cumsum()
In order to get the expected output you just need to do this twice for columns A and B and fill any missing nan value.
data['sumA'] = data[(data['Type'] == 'A')]['Quantity'].cumsum()
data['sumB'] = data[(data['Type'] == 'B')]['Quantity'].cumsum()
# Fill nan values with the previously available value
data.fillna(method='ffill', inplace=True)
# The first values don't have any previous value, so fill with zero
data.fillna(value=0, inplace=True)
This returns the expected value
I thought about somewhat general solution which can surely be optimized (I will try and continue and work on it):
So we iterate over the unique values of our Type
column to create sum{value}
column, then each column will consist the cumsum
of their respected Type
value while non matching values will be NaN
.
Then I fill the NaN
values with the nearest valid value and the last row is to satisfy the special case where the first item in the row is NaN
and needs to be 0
for column in data['Type'].unique():
column_name = f'sum{column}'
data[column_name] = data[data['Type'] == column]['Quantity'].cumsum()
data[column_name].fillna(method='ffill', inplace=True)
data[column_name].fillna(value=0, inplace=True)
output:
Type Quantity sumA sumB
0 A 30 30.0 0.0
1 B 10 30.0 10.0
2 B 5 30.0 15.0
3 A 3 33.0 15.0