Getting row sums for specific columns – Python

Question:

I created a simple dataset df with three columns, Area, Year_2010, Year_2020. The related code:

# Import pandas library

import pandas as pd
  
# initialize list of lists
data = [['Netherlands', 100, 200], ['Belgium', 15, 80], ['Germany', 125, 300]]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Area', 'Year_2010', 'Year_2020'])
  
# print dataframe.
df

#prints
          Area  Year_2010  Year_2020
0  Netherlands        100        200
1      Belgium         15         80
2      Germany        125        300

For two of the three columns (being Year_2010 and Year_2020), I would like to obtain the sum of all related rows. The output of the sum should appear in a new row for both columns.

Ideally, the output reads:

                        Area  Year_2010  Year_2020
0                Netherlands        100        200
1                    Belgium         15         80
2                    Germany        125        300
3                                   240        580

However, if I apply

total = df.sum(axis=0)
df = df.append(total, ignore_index=True)

The Area column will also be aggregated, leading to concatenation of country names in the Area column:

                        Area  Year_2010  Year_2020
0                Netherlands        100        200
1                    Belgium         15         80
2                    Germany        125        300
3  NetherlandsBelgiumGermany        240        580

How can I get the row sums for specific columns, while leaving out other columns?

As an alternative to leaving the value NetherlandsBelgiumGermany blank, the value can be replaced by Total.

Asked By: Foeke Boersma

||

Answers:

One solution could be to filter the columns by type and use indexing:

cols = df.select_dtypes('number').columns

df.loc[len(df), cols] = df[cols].sum()

updated DataFrame:

          Area  Year_2010  Year_2020
0  Netherlands      100.0      200.0
1      Belgium       15.0       80.0
2      Germany      125.0      300.0
3          NaN      240.0      580.0

Alternative if you want the empty string:

cols = df.select_dtypes('number').columns

df.loc[len(df)] = df[cols].sum().reindex(df.columns, fill_value='')

output:

          Area  Year_2010  Year_2020
0  Netherlands        100        200
1      Belgium         15         80
2      Germany        125        300
3                     240        580

NB. this assumes a range index, if this is not the case, please update the example for alternative ways (using pandas.concat for example).

Answered By: mozway

You could do that manually:

df.at[3, "Area"] = "" # or whatever else you want
Answered By: Jock