How to set a value of a column to the value of grouping with pandas?
Question:
The request
My current solution is totally whacky and inefficient with nested loops. There must be a proper way to get the same results.
I wonder if all this can be achieved in just 1 step, or the first step with grouping is necessary?
My solution:
Original data:
Important: This is only the head of the data. Real data contains all the countries. Note this to avoid confusion about the results.
year rank country continent gdp gdp_percent
0 1960 1 USA America 543300000000 0.468483
1 1960 2 UK Europe 73233967692 0.063149
2 1960 3 France Europe 62225478000 0.053656
3 1960 4 China Asia 59716467625 0.051493
4 1960 5 Japan Asia 44307342950 0.038206
The problem
I want to add a column that will have gdp percent of a continent. It’s a simple operation:
1 / SUM(GDP_PERCENT)(grouped by year+continent) * gdp_percent
Code:
## Get the sum
df_grouped = df.groupby(['year', 'continent'])['gdp_percent'].sum().to_frame()
## Nested loops ...
for grouped_df_index, grouped_df_row in df_grouped.iterrows():
for df_index, df_row in df.iterrows():
if df_row[0] == grouped_df_row[0] and df_row[3] == grouped_df_row[1]:
df.loc[df_index, 'gdp_percent_of_continent'] = grouped_df_row[2]
Final result (the last column):
year rank country continent gdp gdp_percent gdp_percent_of_continent
0 1960 1 USA America 543300000000 0.468483 0.848412
1 1960 2 UK Europe 73233967692 0.063149 0.257104
2 1960 3 France Europe 62225478000 0.053656 0.218456
3 1960 4 China Asia 59716467625 0.051493 0.333881
4 1960 5 Japan Asia 44307342950 0.038206 0.247727
Answers:
try:
df['gdp_percent_sum'] = df.groupby(['year', 'continent'])['gdp_percent'].transform(sum)
df['gdp_percent_of_continent'] = (1/df['gdp_percent_sum'])*df['gdp_percent']
df = df.drop(columns="gdp_percent_sum")
year rank country continent gdp gdp_percent gdp_percent_of_continent
0 1960 1 USA America 54330000000 0.468483 1.000000
1 1960 2 UK Europe 73233967692 0.063149 0.540636
2 1960 3 France Europe 62225478000 0.053656 0.459364
3 1960 4 China Asia 59716467625 0.051493 0.574064
4 1960 5 Japan Asia 44307342950 0.038206 0.425936
You can group
the dataframe by year
and continent
then for gdp_percent
column, you can call transform
and pass a lambda function with the formula:
df['gdp_percent_of_continent'] = (
df
.groupby(['year', 'continent'])['gdp_percent']
.transform(lambda x: (1 / x.sum()) * x)
)
year rank country continent gdp gdp_percent
0 1960 1 USA America 543300000000 0.468483
1 1960 2 UK Europe 73233967692 0.063149
2 1960 3 France Europe 62225478000 0.053656
3 1960 4 China Asia 59716467625 0.051493
4 1960 5 Japan Asia 44307342950 0.038206
gdp_percent_of_continent
0 1.000000
1 0.540636
2 0.459364
3 0.574064
4 0.425936
The request
My current solution is totally whacky and inefficient with nested loops. There must be a proper way to get the same results.
I wonder if all this can be achieved in just 1 step, or the first step with grouping is necessary?
My solution:
Original data:
Important: This is only the head of the data. Real data contains all the countries. Note this to avoid confusion about the results.
year rank country continent gdp gdp_percent
0 1960 1 USA America 543300000000 0.468483
1 1960 2 UK Europe 73233967692 0.063149
2 1960 3 France Europe 62225478000 0.053656
3 1960 4 China Asia 59716467625 0.051493
4 1960 5 Japan Asia 44307342950 0.038206
The problem
I want to add a column that will have gdp percent of a continent. It’s a simple operation:
1 / SUM(GDP_PERCENT)(grouped by year+continent) * gdp_percent
Code:
## Get the sum
df_grouped = df.groupby(['year', 'continent'])['gdp_percent'].sum().to_frame()
## Nested loops ...
for grouped_df_index, grouped_df_row in df_grouped.iterrows():
for df_index, df_row in df.iterrows():
if df_row[0] == grouped_df_row[0] and df_row[3] == grouped_df_row[1]:
df.loc[df_index, 'gdp_percent_of_continent'] = grouped_df_row[2]
Final result (the last column):
year rank country continent gdp gdp_percent gdp_percent_of_continent
0 1960 1 USA America 543300000000 0.468483 0.848412
1 1960 2 UK Europe 73233967692 0.063149 0.257104
2 1960 3 France Europe 62225478000 0.053656 0.218456
3 1960 4 China Asia 59716467625 0.051493 0.333881
4 1960 5 Japan Asia 44307342950 0.038206 0.247727
try:
df['gdp_percent_sum'] = df.groupby(['year', 'continent'])['gdp_percent'].transform(sum)
df['gdp_percent_of_continent'] = (1/df['gdp_percent_sum'])*df['gdp_percent']
df = df.drop(columns="gdp_percent_sum")
year rank country continent gdp gdp_percent gdp_percent_of_continent
0 1960 1 USA America 54330000000 0.468483 1.000000
1 1960 2 UK Europe 73233967692 0.063149 0.540636
2 1960 3 France Europe 62225478000 0.053656 0.459364
3 1960 4 China Asia 59716467625 0.051493 0.574064
4 1960 5 Japan Asia 44307342950 0.038206 0.425936
You can group
the dataframe by year
and continent
then for gdp_percent
column, you can call transform
and pass a lambda function with the formula:
df['gdp_percent_of_continent'] = (
df
.groupby(['year', 'continent'])['gdp_percent']
.transform(lambda x: (1 / x.sum()) * x)
)
year rank country continent gdp gdp_percent
0 1960 1 USA America 543300000000 0.468483
1 1960 2 UK Europe 73233967692 0.063149
2 1960 3 France Europe 62225478000 0.053656
3 1960 4 China Asia 59716467625 0.051493
4 1960 5 Japan Asia 44307342950 0.038206
gdp_percent_of_continent
0 1.000000
1 0.540636
2 0.459364
3 0.574064
4 0.425936