Ranking order per group in Pandas
Question:
Consider a dataframe with three columns: group_ID
, item_ID
and value
. Say we have 10 itemIDs
total.
I need to rank each item_ID
(1 to 10) within each group_ID
based on value
, and then see the mean rank (and other stats) across groups (e.g. the IDs with the highest value across groups would get ranks closer to 1). How can I do this in Pandas?
This answer does something very close with qcut
, but not exactly the same.
A data example would look like:
group_ID item_ID value
0 0S00A1HZEy AB 10
1 0S00A1HZEy AY 4
2 0S00A1HZEy AC 35
3 0S03jpFRaC AY 90
4 0S03jpFRaC A5 3
5 0S03jpFRaC A3 10
6 0S03jpFRaC A2 8
7 0S03jpFRaC A4 9
8 0S03jpFRaC A6 2
9 0S03jpFRaC AX 0
which would result in:
group_ID item_ID rank
0 0S00A1HZEy AB 2
1 0S00A1HZEy AY 3
2 0S00A1HZEy AC 1
3 0S03jpFRaC AY 1
4 0S03jpFRaC A5 5
5 0S03jpFRaC A3 2
6 0S03jpFRaC A2 4
7 0S03jpFRaC A4 3
8 0S03jpFRaC A6 6
9 0S03jpFRaC AX 7
Answers:
There are lots of different arguments you can pass to rank
; it looks like you can use rank("dense", ascending=False)
to get the results you want, after doing a groupby
:
>>> df["rank"] = df.groupby("group_ID")["value"].rank(method="dense", ascending=False)
>>> df
group_ID item_ID value rank
0 0S00A1HZEy AB 10 2
1 0S00A1HZEy AY 4 3
2 0S00A1HZEy AC 35 1
3 0S03jpFRaS AY 90 1
4 0S03jpFRaS A5 3 5
5 0S03jpFRaS A3 10 2
6 0S03jpFRaS A2 8 4
7 0S03jpFRaS A4 9 3
8 0S03jpFRaS A6 2 6
9 0S03jpFRaS AX 0 7
But note that if you’re not using a global ranking scheme, finding out the mean rank across groups isn’t very meaningful– unless there are duplicate values in a group (and so you have duplicate rank values) all you’re doing is measuring how many elements there are in a group.
If the dataframe is already sorted on value
, then you can cumulatively count the position of the values in each group.
df['rank'] = df.sort_values(by=['group_ID', 'value']).groupby('group_ID').cumcount(ascending=False) + 1
If you want to ordinally rank values in each group, then you can transform pd.qcut
. This is especially useful if the sizes of the groups are the same or the ranks are meaningful across groups or there are a lot duplicates in each group.
q = 10 # how many buckets to put the values in
df['rank'] = df.groupby('group_ID')['value'].transform(pd.qcut, q=q, labels=False, duplicates='drop')
# for descending order (smaller numbers have higher rank)
df['rank'] = q - df.groupby('group_ID')['value'].transform(pd.qcut, q=q, labels=False, duplicates='drop')
For the data in the OP, the result is as follows (note that the ordinal ranking is the same as groupby.rank
):
Consider a dataframe with three columns: group_ID
, item_ID
and value
. Say we have 10 itemIDs
total.
I need to rank each item_ID
(1 to 10) within each group_ID
based on value
, and then see the mean rank (and other stats) across groups (e.g. the IDs with the highest value across groups would get ranks closer to 1). How can I do this in Pandas?
This answer does something very close with qcut
, but not exactly the same.
A data example would look like:
group_ID item_ID value
0 0S00A1HZEy AB 10
1 0S00A1HZEy AY 4
2 0S00A1HZEy AC 35
3 0S03jpFRaC AY 90
4 0S03jpFRaC A5 3
5 0S03jpFRaC A3 10
6 0S03jpFRaC A2 8
7 0S03jpFRaC A4 9
8 0S03jpFRaC A6 2
9 0S03jpFRaC AX 0
which would result in:
group_ID item_ID rank
0 0S00A1HZEy AB 2
1 0S00A1HZEy AY 3
2 0S00A1HZEy AC 1
3 0S03jpFRaC AY 1
4 0S03jpFRaC A5 5
5 0S03jpFRaC A3 2
6 0S03jpFRaC A2 4
7 0S03jpFRaC A4 3
8 0S03jpFRaC A6 6
9 0S03jpFRaC AX 7
There are lots of different arguments you can pass to rank
; it looks like you can use rank("dense", ascending=False)
to get the results you want, after doing a groupby
:
>>> df["rank"] = df.groupby("group_ID")["value"].rank(method="dense", ascending=False)
>>> df
group_ID item_ID value rank
0 0S00A1HZEy AB 10 2
1 0S00A1HZEy AY 4 3
2 0S00A1HZEy AC 35 1
3 0S03jpFRaS AY 90 1
4 0S03jpFRaS A5 3 5
5 0S03jpFRaS A3 10 2
6 0S03jpFRaS A2 8 4
7 0S03jpFRaS A4 9 3
8 0S03jpFRaS A6 2 6
9 0S03jpFRaS AX 0 7
But note that if you’re not using a global ranking scheme, finding out the mean rank across groups isn’t very meaningful– unless there are duplicate values in a group (and so you have duplicate rank values) all you’re doing is measuring how many elements there are in a group.
If the dataframe is already sorted on value
, then you can cumulatively count the position of the values in each group.
df['rank'] = df.sort_values(by=['group_ID', 'value']).groupby('group_ID').cumcount(ascending=False) + 1
If you want to ordinally rank values in each group, then you can transform pd.qcut
. This is especially useful if the sizes of the groups are the same or the ranks are meaningful across groups or there are a lot duplicates in each group.
q = 10 # how many buckets to put the values in
df['rank'] = df.groupby('group_ID')['value'].transform(pd.qcut, q=q, labels=False, duplicates='drop')
# for descending order (smaller numbers have higher rank)
df['rank'] = q - df.groupby('group_ID')['value'].transform(pd.qcut, q=q, labels=False, duplicates='drop')
For the data in the OP, the result is as follows (note that the ordinal ranking is the same as groupby.rank
):