Pandas – how to drop rows those are top n% in certain column value?
Question:
I have a dataframe of two columns:
userID | count
A | 15
B | 12
about million of rows.
I would like to filter out userID with top n % of count values, as I suspect it is a bot activity.
I tried it with sorting by count, but I can only come up with the way to filter top n rows, not top n ‘%’ rows.
what would be the pandas trick that I can use to filter out based on percentage?
Answers:
Assuming this input and that you want to drop the top 50% (0.5):
userID count
0 A 15
1 B 12
2 C 5
3 D 25
4 E 22
5 F 3
6 G 7
7 H 9
8 I 11
9 J 7
You can sort_values
by descending count, then compute the percent and its cumsum
, finally use boolean indexing to filter out the top values:
target = 0.5
(df.sort_values(by='count', ascending=False)
.assign(pct=lambda d: d['count'].div(d['count'].sum()).cumsum())
.loc[lambda d: d['pct'].gt(target).shift().bfill()]
)
Output:
userID count
1 B 12
8 I 11
7 H 9
6 G 7
9 J 7
2 C 5
5 F 3
Intermediate:
userID count pct keep
3 D 25 0.215517 False
4 E 22 0.405172 False
0 A 15 0.534483 False
1 B 12 0.637931 True
8 I 11 0.732759 True
7 H 9 0.810345 True
6 G 7 0.870690 True
9 J 7 0.931034 True
2 C 5 0.974138 True
5 F 3 1.000000 True
I have a dataframe of two columns:
userID | count
A | 15
B | 12
about million of rows.
I would like to filter out userID with top n % of count values, as I suspect it is a bot activity.
I tried it with sorting by count, but I can only come up with the way to filter top n rows, not top n ‘%’ rows.
what would be the pandas trick that I can use to filter out based on percentage?
Assuming this input and that you want to drop the top 50% (0.5):
userID count
0 A 15
1 B 12
2 C 5
3 D 25
4 E 22
5 F 3
6 G 7
7 H 9
8 I 11
9 J 7
You can sort_values
by descending count, then compute the percent and its cumsum
, finally use boolean indexing to filter out the top values:
target = 0.5
(df.sort_values(by='count', ascending=False)
.assign(pct=lambda d: d['count'].div(d['count'].sum()).cumsum())
.loc[lambda d: d['pct'].gt(target).shift().bfill()]
)
Output:
userID count
1 B 12
8 I 11
7 H 9
6 G 7
9 J 7
2 C 5
5 F 3
Intermediate:
userID count pct keep
3 D 25 0.215517 False
4 E 22 0.405172 False
0 A 15 0.534483 False
1 B 12 0.637931 True
8 I 11 0.732759 True
7 H 9 0.810345 True
6 G 7 0.870690 True
9 J 7 0.931034 True
2 C 5 0.974138 True
5 F 3 1.000000 True