Pandas groupby.agg with condition
Question:
I have a pandas data frame similar to this:
name
sales
profit
profit_flag
Joe
200
100
True
Joe
300
150
False
Mark
200
100
True
Mark
300
150
True
Judy
300
150
False
The actual data frame has 100 columns.
The idea is: I want to group by name, and aggregate all the columns. However, certain columns depend on a flag. In this case, sales
will be aggregated no matter what, but profit
should be included in the aggregation only if profit_flag
is True.
It should look like this if we use sum
:
name
sales
profit
Joe
500
100
Judy
300
Nan
Mark
500
250
Is there anyway I can do this from one line using df.groupby('name').agg()
?
Right now I’m using:
grouped = pd.DataFrame()
grouped['sales'] = df.groupby('name').sales.sum()
grouped['profit'] = df[df.profit_flag].groupby('name').profit.sum()
I’m getting the correct result, but since the actual data frame has many more columns, I wanted to know if I could somehow write something like this to avoid the clutter:
grouped = df.groupby('name').agg({
'sales': 'sum',
'profit' 'sum' #if profit_flag })
Is this even possible or should I just group ‘flag dependent columns’ in separate statemetns?
Answers:
You can mask the values prior to aggregation:
(df.assign(profit=lambda d: d['profit'].where(d['profit_flag']))
.groupby('name', as_index=False)[['sales', 'profit']].sum(min_count=1)
)
Output:
name sales profit
0 Joe 500 100.0
1 Judy 300 NaN
2 Mark 500 250.0
I have a pandas data frame similar to this:
name | sales | profit | profit_flag |
---|---|---|---|
Joe | 200 | 100 | True |
Joe | 300 | 150 | False |
Mark | 200 | 100 | True |
Mark | 300 | 150 | True |
Judy | 300 | 150 | False |
The actual data frame has 100 columns.
The idea is: I want to group by name, and aggregate all the columns. However, certain columns depend on a flag. In this case, sales
will be aggregated no matter what, but profit
should be included in the aggregation only if profit_flag
is True.
It should look like this if we use sum
:
name | sales | profit |
---|---|---|
Joe | 500 | 100 |
Judy | 300 | Nan |
Mark | 500 | 250 |
Is there anyway I can do this from one line using df.groupby('name').agg()
?
Right now I’m using:
grouped = pd.DataFrame()
grouped['sales'] = df.groupby('name').sales.sum()
grouped['profit'] = df[df.profit_flag].groupby('name').profit.sum()
I’m getting the correct result, but since the actual data frame has many more columns, I wanted to know if I could somehow write something like this to avoid the clutter:
grouped = df.groupby('name').agg({
'sales': 'sum',
'profit' 'sum' #if profit_flag })
Is this even possible or should I just group ‘flag dependent columns’ in separate statemetns?
You can mask the values prior to aggregation:
(df.assign(profit=lambda d: d['profit'].where(d['profit_flag']))
.groupby('name', as_index=False)[['sales', 'profit']].sum(min_count=1)
)
Output:
name sales profit
0 Joe 500 100.0
1 Judy 300 NaN
2 Mark 500 250.0