Faster way to rank rows in subgroups in pandas dataframe
Question:
I have a pandas data frame that has is composed of different subgroups.
df = pd.DataFrame({
'id':[1, 2, 3, 4, 5, 6, 7, 8],
'group':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'value':[.01, .4, .2, .3, .11, .21, .4, .01]
})
I want to find the rank of each id in its group with say, lower values being better. In the example above, in group A, Id 1 would have a rank of 1, Id 2 would have a rank of 4. In group B, Id 5 would have a rank of 2, Id 8 would have a rank of 1 and so on.
Right now I assess the ranks by:
- Sorting by value.
df.sort('value', ascending = True, inplace=True)
- Create a ranker function (it assumes variables already sorted)
def ranker(df):
df['rank'] = np.arange(len(df)) + 1
return df
- Apply the ranker function on each group separately:
df = df.groupby(['group']).apply(ranker)
This process works but it is really slow when I run it on millions of rows of data. Does anyone have any ideas on how to make a faster ranker function.
Answers:
rank is cythonized so should be very fast. And you can pass the same options as df.rank()
here are the docs for rank
. As you can see, tie-breaks can be done in one of five different ways via the method
argument.
Its also possible you simply want the .cumcount()
of the group.
In [12]: df.groupby('group')['value'].rank(ascending=False)
Out[12]:
0 4
1 1
2 3
3 2
4 3
5 2
6 1
7 4
dtype: float64
Working with a big DataFrame (13 million lines), the method rank with groupby maxed out my 8GB of RAM an it took a really long time. I found a workaround less greedy in memory , that I put here just in case:
df.sort_values('value')
tmp = df.groupby('group').size()
rank = tmp.map(range)
rank =[item for sublist in rank for item in sublist]
df['rank'] = rank
I have a pandas data frame that has is composed of different subgroups.
df = pd.DataFrame({
'id':[1, 2, 3, 4, 5, 6, 7, 8],
'group':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'value':[.01, .4, .2, .3, .11, .21, .4, .01]
})
I want to find the rank of each id in its group with say, lower values being better. In the example above, in group A, Id 1 would have a rank of 1, Id 2 would have a rank of 4. In group B, Id 5 would have a rank of 2, Id 8 would have a rank of 1 and so on.
Right now I assess the ranks by:
- Sorting by value.
df.sort('value', ascending = True, inplace=True)
- Create a ranker function (it assumes variables already sorted)
def ranker(df):
df['rank'] = np.arange(len(df)) + 1
return df
- Apply the ranker function on each group separately:
df = df.groupby(['group']).apply(ranker)
This process works but it is really slow when I run it on millions of rows of data. Does anyone have any ideas on how to make a faster ranker function.
rank is cythonized so should be very fast. And you can pass the same options as df.rank()
here are the docs for rank
. As you can see, tie-breaks can be done in one of five different ways via the method
argument.
Its also possible you simply want the .cumcount()
of the group.
In [12]: df.groupby('group')['value'].rank(ascending=False)
Out[12]:
0 4
1 1
2 3
3 2
4 3
5 2
6 1
7 4
dtype: float64
Working with a big DataFrame (13 million lines), the method rank with groupby maxed out my 8GB of RAM an it took a really long time. I found a workaround less greedy in memory , that I put here just in case:
df.sort_values('value')
tmp = df.groupby('group').size()
rank = tmp.map(range)
rank =[item for sublist in rank for item in sublist]
df['rank'] = rank