Ranking using multiple columns within groups allowing for tied ranks in Pandas

Question:

Intro and problem

How can I rank observations within groups where the ranking is based on more than just one column and where the ranking allows for tied ranks?

I know how to calculate aggregated group-level statistics using the groupby() method and I also know how to rank using multiple columns without groups (see here, here and here). The main problem seems to be getting both ideas (grouping & ranking) to play nicely together.

This other thread has some ideas on how to solve the problem, but its results don’t show you which rows are tied – it just returns an array of ever-increasing ranks even when the values are identical. The problem is described in more detail in the example I created below.

Minimal reproducible example

import pandas as pd

df = pd.DataFrame({'row_id':[1,2,3,4,5,6,7,8,9,10],
                   'Group':[1,1,1,1,1,2,2,2,2,2],
                   'Var1':[100,100,100,200,200,300,300,400,400,400],
                   'Var2':[5,5,6,7,8,1,1,2,2,3]})
print(df)

#    row_id  Group  Var1  Var2
# 0       1      1   100     5
# 1       2      1   100     5
# 2       3      1   100     6
# 3       4      1   200     7
# 4       5      1   200     8
# 5       6      2   300     1
# 6       7      2   300     1
# 7       8      2   400     2
# 8       9      2   400     2
# 9      10      2   400     3

In the case above, I would like to group using the Group variable and rank using the Var1 and Var2 variables. Therefore, I expect the output to look like this:

#    row_id  Group  Var1  Var2  Rank
# 0       1      1   100     5     1
# 1       2      1   100     5     1
# 2       3      1   100     6     3
# 3       4      1   200     7     4
# 4       5      1   200     8     5
# 5       6      2   300     1     1
# 6       7      2   300     1     1
# 7       8      2   400     2     3
# 8       9      2   400     2     3
# 9      10      2   400     3     5

What I’ve tried

Using the data in the example above, if I would like to group using the Group variable and only rank based on the Var1 column, that would be pretty easy:

df['Rank_Only_Var1'] = df.groupby('Group')['Var1'].rank(method='min', ascending=True)

print(df)

#    row_id  Group  Var1  Var2  Rank_Only_Var1
# 0       1      1   100     5             1.0
# 1       2      1   100     5             1.0
# 2       3      1   100     6             1.0
# 3       4      1   200     7             4.0
# 4       5      1   200     8             4.0
# 5       6      2   300     1             1.0
# 6       7      2   300     1             1.0
# 7       8      2   400     2             3.0
# 8       9      2   400     2             3.0
# 9      10      2   400     3             3.0

However, if I want to group using the Group variable and rank using the Var1 and Var2 variables, things get hairy. Using the approach suggested by this other post, we arrive at the following results:

df = df.sort_values(['Var1', 'Var1'], ascending=[True, True])
df['overall_rank'] = 1
df['overall_rank'] = df.groupby(['Group'])['overall_rank'].cumsum()

print(df)

#    row_id  Group  Var1  Var2  overall_rank
# 0       1      1   100     5             1
# 1       2      1   100     5             2
# 2       3      1   100     6             3
# 3       4      1   200     7             4
# 4       5      1   200     8             5
# 5       6      2   300     1             1
# 6       7      2   300     1             2
# 7       8      2   400     2             3
# 8       9      2   400     2             4
# 9      10      2   400     3             5

Note how the first and second rows have identical values for Var1 and Var2, but the first row is ranked 1 and the second row is ranked 2. Those two rows shouldn’t have different ranks. Their ranks should be identical and tied, because the values the rank is based on are identical and tied. This problem also happens with rows 6 & 7 as well as with rows 8 & 9.

I even tried adapting the solution from this answer, but it doesn’t work when we have a groupby statement.

Back to the heart of the question

How can I rank observations within groups where the ranking is based on more than just one column and where the ranking allows for tied ranks?

Asked By: Felipe D.

||

Answers:

It’s not clear why you can’t use the linked solution with a .groupby

You can also replace the .apply for better performance:

(df.assign(key = list(map(tuple, df[['Var1', 'Var2']].values)))
   .groupby('Group')['key'].rank('dense'))
0    1.0
1    1.0
2    2.0
3    3.0
4    4.0
5    1.0
6    1.0
7    2.0
8    2.0
9    3.0
Name: key, dtype: float64
Answered By: jqurious