Sort DataFrame by occurrence in one column, while preserving order in other columns
Question:
I would like to sort DataFrame in a similar fashion to this SO question:
Sorting entire csv by frequency of occurence in one column
However, one issue I’m encountering is that the count is not guaranteed to be unique and in that case rows will be interleaved (I’m using the method suggested by EdChum in the above question)
Given the following DataFrame:
cluster_id,distance,url
1,0.15,aaa.com
1,0.25,bbb.com
2,0.05,ccc.com
2,0.10,ccc.com
7,0.1,abc.com
7,0.2,def.com
7,0.3,xyz.com
After I would like it to be:
cluster_id,distance,url
7,0.1,abc.com
7,0.2,def.com
7,0.3,xyz.com
1,0.15,aaa.com
1,0.25,bbb.com
2,0.05,ccc.com
2,0.10,ccc.com
Note that columns cluster_id and distance are still in order, after sorting by occurrence of “cluster_id”
Answers:
We can sort by cluster_id
and new column’G’:
df.assign(G=df.groupby('cluster_id').cluster_id.transform('count')).sort_values(['G','cluster_id'],ascending=[False,True]).drop('G',1)
Out[248]:
cluster_id distance url
4 7 0.10 abc.com
5 7 0.20 def.com
6 7 0.30 xyz.com
0 1 0.15 aaa.com
1 1 0.25 bbb.com
2 2 0.05 ccc.com
3 2 0.10 ccc.com
`
pno dn
0 A AA
1 B BB
2 A AA
`
to sort in ascending order
g.assign(G=g.groupby(‘dn’).dn.transform(‘count’)).sort_values([‘G’,’dn’],ascending=[True,False]).drop(‘G’,1)
pno dn
1 B BB
0 A AA
2 A AA
I would like to sort DataFrame in a similar fashion to this SO question:
Sorting entire csv by frequency of occurence in one column
However, one issue I’m encountering is that the count is not guaranteed to be unique and in that case rows will be interleaved (I’m using the method suggested by EdChum in the above question)
Given the following DataFrame:
cluster_id,distance,url
1,0.15,aaa.com
1,0.25,bbb.com
2,0.05,ccc.com
2,0.10,ccc.com
7,0.1,abc.com
7,0.2,def.com
7,0.3,xyz.com
After I would like it to be:
cluster_id,distance,url
7,0.1,abc.com
7,0.2,def.com
7,0.3,xyz.com
1,0.15,aaa.com
1,0.25,bbb.com
2,0.05,ccc.com
2,0.10,ccc.com
Note that columns cluster_id and distance are still in order, after sorting by occurrence of “cluster_id”
We can sort by cluster_id
and new column’G’:
df.assign(G=df.groupby('cluster_id').cluster_id.transform('count')).sort_values(['G','cluster_id'],ascending=[False,True]).drop('G',1)
Out[248]:
cluster_id distance url
4 7 0.10 abc.com
5 7 0.20 def.com
6 7 0.30 xyz.com
0 1 0.15 aaa.com
1 1 0.25 bbb.com
2 2 0.05 ccc.com
3 2 0.10 ccc.com
`
pno dn
0 A AA
1 B BB
2 A AA
`
to sort in ascending order
g.assign(G=g.groupby(‘dn’).dn.transform(‘count’)).sort_values([‘G’,’dn’],ascending=[True,False]).drop(‘G’,1)
pno dn
1 B BB
0 A AA
2 A AA