In Pandas, how to retrieve the rows which created each group, after aggregation and filtering?
Question:
Let
import pandas as pd
df = pd.DataFrame(
{
'a': ['A', 'A', 'B', 'B', 'B', 'C'],
'b': [True, True, True, False, False, True]
}
)
print(df)
groups = df.groupby('a') # "A", "B", "C"
agg_groups = groups.agg({'b':lambda x: all(x)}) # "A": True, "B": False, "C": True
agg_df = agg_groups.reset_index()
filtered_df = agg_df[agg_df["b"]] # "A": True, "C": True
print(filtered_df)
# Now I want to get back the original df's rows, but only the remaining ones after group filtering
current output:
a b
0 A True
1 A True
2 B True
3 B False
4 B False
5 C True
a b
0 A True
2 C True
Required:
a b
0 A True
1 A True
2 B True
3 B False
4 B False
5 C True
a b
0 A True
2 C True
a b
0 A True
1 A True
5 C True
Answers:
Use GroupBy.transform
for get all Trues to mask with same size like original DataFrame, so possible use boolean indexing
:
df1 = df[df.groupby('a')['b'].transform('all')]
#alternative
#f = lambda x: x.all()
#df1 = df[df.groupby('a')['b'].transform(f)]
print (df1)
a b
0 A True
1 A True
5 C True
If want filter in aggregation function output is boolean Series and filter match indices mapped by original column a
:
ids = df.groupby('a')['b'].all()
df1 = df[df.a.isin(ids.index[ids])]
print (df1)
a b
0 A True
1 A True
5 C True
Your solution is similar with filter column b
:
groups = df.groupby('a')
agg_groups = groups.agg({'b':lambda x: all(x)})
df1 = df[df.a.isin(agg_groups.index[agg_groups['b']])]
print (df1)
a b
0 A True
1 A True
5 C True
df[df['a'].isin(filtered_df['a'].unique())]
Results in:
a b
0 A True
1 A True
5 C True
One can filter the original df
by keeping the rows where the column a
is present in the column a
of the filtered_df
in a variety of ways. Below will leave two potential options.
Option 1
As per OP’s request to use a custom lambda, one can use pandas.DataFrame.apply
as follows
final_df = df[df.apply(lambda row: row['a'] in filtered_df['a'].values, axis=1)]
[Out]:
a b
0 A True
1 A True
5 C True
Option 2
Another way to solve it is to filtering the original df
by keeping the rows where the column a
is present in the column a
of the filtered_df
.
For that, one can use pandas.Series.isin
as follows
finaldf = df[df['a'].isin(filtered_df['a'])]
[Out]:
a b
0 A True
1 A True
5 C True
Notes:
- There are strong opinions on using
.apply()
. Would recommend reading this: When should I (not) want to use pandas apply() in my code?
Let
import pandas as pd
df = pd.DataFrame(
{
'a': ['A', 'A', 'B', 'B', 'B', 'C'],
'b': [True, True, True, False, False, True]
}
)
print(df)
groups = df.groupby('a') # "A", "B", "C"
agg_groups = groups.agg({'b':lambda x: all(x)}) # "A": True, "B": False, "C": True
agg_df = agg_groups.reset_index()
filtered_df = agg_df[agg_df["b"]] # "A": True, "C": True
print(filtered_df)
# Now I want to get back the original df's rows, but only the remaining ones after group filtering
current output:
a b
0 A True
1 A True
2 B True
3 B False
4 B False
5 C True
a b
0 A True
2 C True
Required:
a b
0 A True
1 A True
2 B True
3 B False
4 B False
5 C True
a b
0 A True
2 C True
a b
0 A True
1 A True
5 C True
Use GroupBy.transform
for get all Trues to mask with same size like original DataFrame, so possible use boolean indexing
:
df1 = df[df.groupby('a')['b'].transform('all')]
#alternative
#f = lambda x: x.all()
#df1 = df[df.groupby('a')['b'].transform(f)]
print (df1)
a b
0 A True
1 A True
5 C True
If want filter in aggregation function output is boolean Series and filter match indices mapped by original column a
:
ids = df.groupby('a')['b'].all()
df1 = df[df.a.isin(ids.index[ids])]
print (df1)
a b
0 A True
1 A True
5 C True
Your solution is similar with filter column b
:
groups = df.groupby('a')
agg_groups = groups.agg({'b':lambda x: all(x)})
df1 = df[df.a.isin(agg_groups.index[agg_groups['b']])]
print (df1)
a b
0 A True
1 A True
5 C True
df[df['a'].isin(filtered_df['a'].unique())]
Results in:
a b
0 A True
1 A True
5 C True
One can filter the original df
by keeping the rows where the column a
is present in the column a
of the filtered_df
in a variety of ways. Below will leave two potential options.
Option 1
As per OP’s request to use a custom lambda, one can use pandas.DataFrame.apply
as follows
final_df = df[df.apply(lambda row: row['a'] in filtered_df['a'].values, axis=1)]
[Out]:
a b
0 A True
1 A True
5 C True
Option 2
Another way to solve it is to filtering the original df
by keeping the rows where the column a
is present in the column a
of the filtered_df
.
For that, one can use pandas.Series.isin
as follows
finaldf = df[df['a'].isin(filtered_df['a'])]
[Out]:
a b
0 A True
1 A True
5 C True
Notes:
- There are strong opinions on using
.apply()
. Would recommend reading this: When should I (not) want to use pandas apply() in my code?