Creating a Calculated Row Based on Multiple Conditions in a Dataframe with Python
Question:
Trying to find the best way to generate an ‘Other’ row in my pandas dataframe. ‘Other’ is calculated by adding up all the Source values that are not ‘Total’ and then subtracting by the ‘Total’ values.
Ex: ‘Other’ = Total – (Souce_1 + Souce_2 + Souce_3)
Here’s an example of what I am starting with:
Name
Source
Lead
Sale
Prop_A
Source_1
100
3
Prop_A
Source_2
50
5
Prop_A
Source_3
20
0
Prop_A
Total
300
11
Prop_B
Source_1
200
10
Prop_B
Source_2
300
6
Prop_B
Source_3
20
0
Prop_B
Total
700
23
And this is what I am try to create:
Name
Source
Lead
Sale
Prop_A
Source_1
100
3
Prop_A
Source_2
50
5
Prop_A
Source_3
20
0
Prop_A
Other
130
3
Prop_A
Total
300
11
Prop_B
Source_1
200
10
Prop_B
Source_2
300
6
Prop_B
Source_3
20
0
Prop_B
Other
180
7
Prop_B
Total
700
23
I was able to calculate the ‘Other’ row by using following code, but know this isn’t the best way to do it. Wondering if anyone knows a better way?
Total_df = df[df['Source'] == 'Total']
All_Sources_df = df[df['Source'] != 'Total']
All_Sources_df = All_Sources_df.groupby(['Name'], as_index=False).sum()
result = pd.merge(Total_df, All_Sources_df, on=['Name'])
result['Lead'] = result['Lead_x'] - result['Lead_y']
result['Sale'] = result['Sale_x'] - result['Sale_y']
result = result[['Name', 'Lead', 'Sale']]
result['Source'] = 'Other'
result = result[['Name','Source','Lead','Sale']]
Answers:
You could try this. I assumed you may have more columns other than Total, so in this case we can use ~
with pd.DataFrame.isin
to refer to all the other columns other that Total :
(df.set_index('Name')
.pivot(columns='Source')
.stack(level=0)
.assign(other = lambda d: d['Total'] - (d.loc[:, ~ d.columns.isin(['Total'])].sum(axis=1)))
.stack()
.reset_index()
.pivot(index=['Name', 'Source'], columns='level_1')[0])
level_1 Lead Sale
Name Source
Prop_A Source_1 100 3
Source_2 50 5
Source_3 20 0
Total 300 11
other 130 3
Prop_B Source_1 200 10
Source_2 300 6
Source_3 20 0
Total 700 23
other 180 7
You can use:
cols = ['Name', 'Lead', 'Sale']
m = df['Source'] == 'Total'
other = (df.loc[m, cols].set_index('Name')
.sub(df.loc[~m, cols].groupby('Name').sum())
.assign(Source='Other').reset_index())
result = pd.concat([df, other]).sort_values(['Name', 'Source'], ignore_index=True)
print(result)
# Output
Name Source Lead Sale
0 Prop_A Other 130 3
1 Prop_A Source_1 100 3
2 Prop_A Source_2 50 5
3 Prop_A Source_3 20 0
4 Prop_A Total 300 11
5 Prop_B Other 180 7
6 Prop_B Source_1 200 10
7 Prop_B Source_2 300 6
8 Prop_B Source_3 20 0
9 Prop_B Total 700 23
With custom apply
function and pd.concat
(to arrange records for each group):
def f(x):
m = x['Source'].eq('Total') # mask denoting Total record
t = x.loc[m, ['Lead', 'Sale']] # 'Total' values
return pd.concat([x[~m], t.sub(x.loc[~m, ['Lead', 'Sale']].sum())
.assign(Source='Other', Name=x.Name), x[m]])
df = df.groupby('Name', as_index=False).apply(f).reset_index(drop=True)
Name Source Lead Sale
0 Prop_A Source_1 100 3
1 Prop_A Source_2 50 5
2 Prop_A Source_3 20 0
3 Prop_A Other 130 3
4 Prop_A Total 300 11
5 Prop_B Source_1 200 10
6 Prop_B Source_2 300 6
7 Prop_B Source_3 20 0
8 Prop_B Other 180 7
9 Prop_B Total 700 23
Trying to find the best way to generate an ‘Other’ row in my pandas dataframe. ‘Other’ is calculated by adding up all the Source values that are not ‘Total’ and then subtracting by the ‘Total’ values.
Ex: ‘Other’ = Total – (Souce_1 + Souce_2 + Souce_3)
Here’s an example of what I am starting with:
Name | Source | Lead | Sale |
---|---|---|---|
Prop_A | Source_1 | 100 | 3 |
Prop_A | Source_2 | 50 | 5 |
Prop_A | Source_3 | 20 | 0 |
Prop_A | Total | 300 | 11 |
Prop_B | Source_1 | 200 | 10 |
Prop_B | Source_2 | 300 | 6 |
Prop_B | Source_3 | 20 | 0 |
Prop_B | Total | 700 | 23 |
And this is what I am try to create:
Name | Source | Lead | Sale |
---|---|---|---|
Prop_A | Source_1 | 100 | 3 |
Prop_A | Source_2 | 50 | 5 |
Prop_A | Source_3 | 20 | 0 |
Prop_A | Other | 130 | 3 |
Prop_A | Total | 300 | 11 |
Prop_B | Source_1 | 200 | 10 |
Prop_B | Source_2 | 300 | 6 |
Prop_B | Source_3 | 20 | 0 |
Prop_B | Other | 180 | 7 |
Prop_B | Total | 700 | 23 |
I was able to calculate the ‘Other’ row by using following code, but know this isn’t the best way to do it. Wondering if anyone knows a better way?
Total_df = df[df['Source'] == 'Total']
All_Sources_df = df[df['Source'] != 'Total']
All_Sources_df = All_Sources_df.groupby(['Name'], as_index=False).sum()
result = pd.merge(Total_df, All_Sources_df, on=['Name'])
result['Lead'] = result['Lead_x'] - result['Lead_y']
result['Sale'] = result['Sale_x'] - result['Sale_y']
result = result[['Name', 'Lead', 'Sale']]
result['Source'] = 'Other'
result = result[['Name','Source','Lead','Sale']]
You could try this. I assumed you may have more columns other than Total, so in this case we can use ~
with pd.DataFrame.isin
to refer to all the other columns other that Total :
(df.set_index('Name')
.pivot(columns='Source')
.stack(level=0)
.assign(other = lambda d: d['Total'] - (d.loc[:, ~ d.columns.isin(['Total'])].sum(axis=1)))
.stack()
.reset_index()
.pivot(index=['Name', 'Source'], columns='level_1')[0])
level_1 Lead Sale
Name Source
Prop_A Source_1 100 3
Source_2 50 5
Source_3 20 0
Total 300 11
other 130 3
Prop_B Source_1 200 10
Source_2 300 6
Source_3 20 0
Total 700 23
other 180 7
You can use:
cols = ['Name', 'Lead', 'Sale']
m = df['Source'] == 'Total'
other = (df.loc[m, cols].set_index('Name')
.sub(df.loc[~m, cols].groupby('Name').sum())
.assign(Source='Other').reset_index())
result = pd.concat([df, other]).sort_values(['Name', 'Source'], ignore_index=True)
print(result)
# Output
Name Source Lead Sale
0 Prop_A Other 130 3
1 Prop_A Source_1 100 3
2 Prop_A Source_2 50 5
3 Prop_A Source_3 20 0
4 Prop_A Total 300 11
5 Prop_B Other 180 7
6 Prop_B Source_1 200 10
7 Prop_B Source_2 300 6
8 Prop_B Source_3 20 0
9 Prop_B Total 700 23
With custom apply
function and pd.concat
(to arrange records for each group):
def f(x):
m = x['Source'].eq('Total') # mask denoting Total record
t = x.loc[m, ['Lead', 'Sale']] # 'Total' values
return pd.concat([x[~m], t.sub(x.loc[~m, ['Lead', 'Sale']].sum())
.assign(Source='Other', Name=x.Name), x[m]])
df = df.groupby('Name', as_index=False).apply(f).reset_index(drop=True)
Name Source Lead Sale
0 Prop_A Source_1 100 3
1 Prop_A Source_2 50 5
2 Prop_A Source_3 20 0
3 Prop_A Other 130 3
4 Prop_A Total 300 11
5 Prop_B Source_1 200 10
6 Prop_B Source_2 300 6
7 Prop_B Source_3 20 0
8 Prop_B Other 180 7
9 Prop_B Total 700 23