apply where if only in rows where condition is met
Question:
How do you apply a groupby in a dataframe only when a condition is met in a row and still keep all other rows that do not meet said condition?
For instance, let’s say I have a dataframe:
Number
Version
Binary
Random column
1
0
True
a
1
1
True
b
1
2
True
c
2
0
False
a
2
1
False
b
And, with this dataframe, I would like to groupby all of the numbers which binary value is True and keep only the one with the latest version (that is, the largest number), while still respecting and keeping the rows which do not meet the condition described.
That said, it should give a result like this:
Number
Version
Binary
Random column
1
2
True
c
2
0
False
a
2
1
False
b
Answers:
Here is a way using pd.concat()
and drop_duplicates()
(pd.concat([df.loc[df.loc[df['Binary']].groupby('Number')['Version'].idxmax()],
df.loc[~df['Binary']]],
ignore_index=True))
Output:
Number Version Binary Random column
0 1 2 True c
1 2 0 False a
2 2 1 False b
I would use boolean indexing here:
# is the row the max version of its group?
m1 = df['Version'].eq(df.groupby('Binary')['Version'].transform('max'))
# is the binary not True?
m2 = ~df['Binary']
# keep if either condition above is met
df[m1|m2]
Output:
Number Version Binary Random column
2 1 2 True c
3 2 0 False a
4 2 1 False b
alternative
Assuming you would like to perform an expensive operation, you could also apply only if the group matches a condition, else return the group unchanged:
(df.groupby('Binary', group_keys=False, sort=False)
.apply(lambda g: g.loc[g['Version'].eq(g['Version'].max())]
if g.name else g)
)
How do you apply a groupby in a dataframe only when a condition is met in a row and still keep all other rows that do not meet said condition?
For instance, let’s say I have a dataframe:
Number | Version | Binary | Random column |
---|---|---|---|
1 | 0 | True | a |
1 | 1 | True | b |
1 | 2 | True | c |
2 | 0 | False | a |
2 | 1 | False | b |
And, with this dataframe, I would like to groupby all of the numbers which binary value is True and keep only the one with the latest version (that is, the largest number), while still respecting and keeping the rows which do not meet the condition described.
That said, it should give a result like this:
Number | Version | Binary | Random column |
---|---|---|---|
1 | 2 | True | c |
2 | 0 | False | a |
2 | 1 | False | b |
Here is a way using pd.concat()
and drop_duplicates()
(pd.concat([df.loc[df.loc[df['Binary']].groupby('Number')['Version'].idxmax()],
df.loc[~df['Binary']]],
ignore_index=True))
Output:
Number Version Binary Random column
0 1 2 True c
1 2 0 False a
2 2 1 False b
I would use boolean indexing here:
# is the row the max version of its group?
m1 = df['Version'].eq(df.groupby('Binary')['Version'].transform('max'))
# is the binary not True?
m2 = ~df['Binary']
# keep if either condition above is met
df[m1|m2]
Output:
Number Version Binary Random column
2 1 2 True c
3 2 0 False a
4 2 1 False b
alternative
Assuming you would like to perform an expensive operation, you could also apply only if the group matches a condition, else return the group unchanged:
(df.groupby('Binary', group_keys=False, sort=False)
.apply(lambda g: g.loc[g['Version'].eq(g['Version'].max())]
if g.name else g)
)