[Pandas]how to get top-n% records within each group

Question

This is my dataFrame

df = pd.DataFrame([['@1','A',40],['@2','A',60],['@3','A',47],['@4','B',33],['@5','B',69],['@6','B',22],['@7','B',90],['@8
','C',31],['@9','C',78],['@10','C',12],['@11','C',89],['@12','C',88],['@13','C',99]],columns=['id','channel','score'])

     id channel  score
0    @1       A     40
1    @2       A     60
2    @3       A     47
3    @4       B     33
4    @5       B     69
5    @6       B     22
6    @7       B     90
7    @8       C     31
8    @9       C     78
9   @10       C     12
10  @11       C     89
11  @12       C     88
12  @13       C     99

Each channel has its own total number , I set a percent number = 80%

and I want to take int(channel’num * 0.8) nlargest , so it’s will be

A channel take int(3*0.8) = 2
B channel take int(4*0.8) = 3
C channel take int(6*0.8) = 4

     id channel  score
1    @2       A     60
2    @3       A     47
3    @4       B     33
4    @5       B     69
6    @7       B     90
8    @9       C     78
10  @11       C     89
11  @12       C     88
12  @13       C     99

How can I do that , thanks.

Asked By: Relax ZeroC

||

Source

Answer 1

Use groupby with nlargest:

a = 0.8

df1 = (df.groupby('channel',group_keys=False)
        .apply(lambda x: x.nlargest(int(len(x) * a), 'score')))
print (df1)     
     id channel  score
1    @2       A     60
2    @3       A     47
6    @7       B     90
4    @5       B     69
3    @4       B     33
12  @13       C     99
10  @11       C     89
11  @12       C     88
8    @9       C     78

Another solution with sort_values + groupby + head:

df1 = (df.sort_values('score', ascending=False)
        .groupby('channel',group_keys=False)
        .apply(lambda x: x.head(int(len(x) * a)))
        .reset_index(drop=True))

print (df1)
    id channel  score
0   @2       A     60
1   @3       A     47
2   @7       B     90
3   @5       B     69
4   @4       B     33
5  @13       C     99
6  @11       C     89
7  @12       C     88
8   @9       C     78

Answered By: jezrael

Answer 2

There is another solution that does not require the usage of apply:
Enumerate the rows in each group using cumcount and devide that by the group size to get the percentile the row belongs to in the group. Based on this you can create a mask to select the rows you want from the DataFrame:

key = 'channel'

# Group position for each row
group_idx = df.groupby(key).cumcount()
# Group size for each row
group_size = df.groupby(key)[key].transform('size')

mask = (group_idx/group_size) < 0.8
group_top_pct = df[mask]

Answered By: user1587520

[Pandas]how to get top-n% records within each group

Question:

Answers: