Add rows in Pandas based on condition (grouping)

Question:

I’ve googled quite some bit regarding this and could not find an answer that applied to my problem. The issue Im having is that I’ve got a dataframe and each row has a variable and I want to continuously insert rows with variable C which is the value of variable A + B. Example:

      TOWN        YEAR           Var     Value                 
      Amsterdam   2019            A        1
      Amsterdam   2019            B        2
      Amsterdam   2020            A        1
      Amsterdam   2020            B        3
      Rotterdam   2019            A        4
      Rotterdam   2019            B        4
      Rotterdam   2020            A        5
      Rotterdam   2020            B        2

Where the desired output would insert a row and sum A and B respectively for rows that are identical in the other columns. My attempt right now backfired as I used groupby and sum, then converted it into a list and then just tried to append it as a seperate column (var_C). The reason it backfired is because I had to duplicate each value to match the length of the original dataset. In the end the length of the list did not match the length of the original dataset.

data_current = data[data['var'].isin(['A', 'B'])]
data_var_c = data_current.groupby(['TOWN', 'year'])['value'].sum()
values = data_var_c.tolist()
values_dup = [val for val in values for _ in (0, 1)]
len(values_dup)

Any feedback would be appreciated!

Asked By: ajayy

||

Answers:

You can use groupby and pd.concat:

result = (
    pd.concat([
        df,
        df.groupby(['TOWN', 'YEAR'], as_index=False)
        .agg(sum)
        .assign(Var = 'C')
        ])
    )

result = result.sort_values(['TOWN', 'YEAR', 'Var'])

OUTPUT:

       TOWN  YEAR Var  Value
0  Amsterdam  2019   A      1
1  Amsterdam  2019   B      2
0  Amsterdam  2019   C      3
2  Amsterdam  2020   A      1
3  Amsterdam  2020   B      3
1  Amsterdam  2020   C      4
4  Rotterdam  2019   A      4
5  Rotterdam  2019   B      4
2  Rotterdam  2019   C      8
6  Rotterdam  2020   A      5
7  Rotterdam  2020   B      2
3  Rotterdam  2020   C      7
Answered By: Nk03

I overcomplicated it, it is as simple as grouping by TOWN and Year, taking the value column and apply a sum function to get the overall sum:

data['c'] = data_current.groupby(['TOWN', 'year'])['value'].transform('sum')

This, however, is not the desired output as it adds the summation as another column. Whereas, Nk03’s answer adds the summation as a seperate row.

Answered By: ajayy

A pivot stack option:

import pandas as pd

df = pd.DataFrame({
    'TOWN': {0: 'Amsterdam', 1: 'Amsterdam', 2: 'Amsterdam', 3: 'Amsterdam',
             4: 'Rotterdam', 5: 'Rotterdam', 6: 'Rotterdam', 7: 'Rotterdam'},
    'YEAR': {0: 2019, 1: 2019, 2: 2020, 3: 2020, 4: 2019, 5: 2019, 6: 2020,
             7: 2020},
    'Var': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A', 5: 'B', 6: 'A', 7: 'B'},
    'Value': {0: 1, 1: 2, 2: 1, 3: 3, 4: 4, 5: 4, 6: 5, 7: 2}
})

new_df = df.pivot(index=['TOWN', 'YEAR'], columns='Var')['Value'] 
    .assign(C=lambda x: x.agg('sum', axis=1)) 
    .stack() 
    .rename('Value') 
    .reset_index()

print(new_df)

new_df:

         TOWN  YEAR Var  Value
0   Amsterdam  2019   A      1
1   Amsterdam  2019   B      2
2   Amsterdam  2019   C      3
3   Amsterdam  2020   A      1
4   Amsterdam  2020   B      3
5   Amsterdam  2020   C      4
6   Rotterdam  2019   A      4
7   Rotterdam  2019   B      4
8   Rotterdam  2019   C      8
9   Rotterdam  2020   A      5
10  Rotterdam  2020   B      2
11  Rotterdam  2020   C      7
Answered By: Henry Ecker
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.