Pandas – how to drop rows those are top n% in certain column value?

Question:

I have a dataframe of two columns:

userID | count
A      | 15
B      | 12

about million of rows.
I would like to filter out userID with top n % of count values, as I suspect it is a bot activity.

I tried it with sorting by count, but I can only come up with the way to filter top n rows, not top n ‘%’ rows.

what would be the pandas trick that I can use to filter out based on percentage?

Asked By: Daniel Kim

||

Answers:

Assuming this input and that you want to drop the top 50% (0.5):

  userID  count
0      A     15
1      B     12
2      C      5
3      D     25
4      E     22
5      F      3
6      G      7
7      H      9
8      I     11
9      J      7

You can sort_values by descending count, then compute the percent and its cumsum, finally use boolean indexing to filter out the top values:

target = 0.5
(df.sort_values(by='count', ascending=False)
   .assign(pct=lambda d: d['count'].div(d['count'].sum()).cumsum())
   .loc[lambda d: d['pct'].gt(target).shift().bfill()]
)

Output:

  userID  count
1      B     12
8      I     11
7      H      9
6      G      7
9      J      7
2      C      5
5      F      3

Intermediate:

  userID  count       pct   keep
3      D     25  0.215517  False
4      E     22  0.405172  False
0      A     15  0.534483  False
1      B     12  0.637931   True
8      I     11  0.732759   True
7      H      9  0.810345   True
6      G      7  0.870690   True
9      J      7  0.931034   True
2      C      5  0.974138   True
5      F      3  1.000000   True
Answered By: mozway
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.