[Pandas]how to get top-n% records within each group

Question:

This is my dataFrame

df = pd.DataFrame([['@1','A',40],['@2','A',60],['@3','A',47],['@4','B',33],['@5','B',69],['@6','B',22],['@7','B',90],['@8
','C',31],['@9','C',78],['@10','C',12],['@11','C',89],['@12','C',88],['@13','C',99]],columns=['id','channel','score'])

     id channel  score
0    @1       A     40
1    @2       A     60
2    @3       A     47
3    @4       B     33
4    @5       B     69
5    @6       B     22
6    @7       B     90
7    @8       C     31
8    @9       C     78
9   @10       C     12
10  @11       C     89
11  @12       C     88
12  @13       C     99

Each channel has its own total number , I set a percent number = 80%

and I want to take int(channel’num * 0.8) nlargest , so it’s will be

A channel take int(3*0.8) = 2
B channel take int(4*0.8) = 3
C channel take int(6*0.8) = 4

     id channel  score
1    @2       A     60
2    @3       A     47
3    @4       B     33
4    @5       B     69
6    @7       B     90
8    @9       C     78
10  @11       C     89
11  @12       C     88
12  @13       C     99

How can I do that , thanks.

Asked By: Relax ZeroC

||

Answers:

Use groupby with nlargest:

a = 0.8

df1 = (df.groupby('channel',group_keys=False)
        .apply(lambda x: x.nlargest(int(len(x) * a), 'score')))
print (df1)     
     id channel  score
1    @2       A     60
2    @3       A     47
6    @7       B     90
4    @5       B     69
3    @4       B     33
12  @13       C     99
10  @11       C     89
11  @12       C     88
8    @9       C     78

Another solution with sort_values + groupby + head:

df1 = (df.sort_values('score', ascending=False)
        .groupby('channel',group_keys=False)
        .apply(lambda x: x.head(int(len(x) * a)))
        .reset_index(drop=True))

print (df1)
    id channel  score
0   @2       A     60
1   @3       A     47
2   @7       B     90
3   @5       B     69
4   @4       B     33
5  @13       C     99
6  @11       C     89
7  @12       C     88
8   @9       C     78
Answered By: jezrael

There is another solution that does not require the usage of apply:
Enumerate the rows in each group using cumcount and devide that by the group size to get the percentile the row belongs to in the group. Based on this you can create a mask to select the rows you want from the DataFrame:

key = 'channel'

# Group position for each row
group_idx = df.groupby(key).cumcount()
# Group size for each row
group_size = df.groupby(key)[key].transform('size')

mask = (group_idx/group_size) < 0.8
group_top_pct = df[mask]
Answered By: user1587520
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.