How to sort pandas dataframe by custom order on string index
Question:
I have the following data frame:
import pandas as pd
# Create DataFrame
df = pd.DataFrame(
{'id':[2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker' ,
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991 ,2004 ,2001 ,2009 ,1997],
'Age': [27 ,25 ,22 ,34 ,31],
'Tm':['CHH' ,'VAN' ,'TOT' ,'OKC' ,'DAL'],
'G':[6 ,7 ,60 ,52 ,81]})
df.set_index('Player', inplace=True)
It shows:
Out[128]:
Age G Tm Year id
Player
Cedric Hunter 27 6 CHH 1991 2967
Maurice Baker 25 7 VAN 2004 5335
Ratko Varda 22 60 TOT 2001 13950
Ryan Bowen 34 52 OKC 2009 6141
Adrian Caldwell 31 81 DAL 1997 6169
What I want to do is to sort the ‘Player’ index in the arbitrary order according to this list (NOTE: not alphabetical order):
reorderlist = [ 'Maurice Baker', 'Adrian Caldwell','Ratko Varda' ,'Ryan Bowen' ,'Cedric Hunter']
How can I do that?
Answers:
To get a custom sort-order on your list of strings, declare it as a categorical and manually specify that order in a sort:
player_order = pd.Categorical([ 'Maurice Baker', 'Adrian Caldwell','Ratko Varda' ,'Ryan Bowen' ,'Cedric Hunter'],
ordered=True)
This is since pandas does not yet allow Categoricals as indices: df.set_index(keys=player_order, inplace=True)
TypeError: unhashable type: 'Categorical'
So you’ll want to do a manual custom sort using df.sort_index(level=player_order)
Just reindex
df.reindex(reorderlist)
Out[89]:
Age G Tm Year id
Player
Maurice Baker 25 7 VAN 2004 5335
Adrian Caldwell 31 81 DAL 1997 6169
Ratko Varda 22 60 TOT 2001 13950
Ryan Bowen 34 52 OKC 2009 6141
Cedric Hunter 27 6 CHH 1991 2967
Update info you have multiple players with same name
out = df.iloc[pd.Categorical(df.index,reorderlist).argsort()]
As of Pandas 1.1 DataFrame.sort_values has a key
param that takes a callable to control sorting. So you could use an approach like the following:
def sorter(column):
reorder = [
"Maurice Baker",
"Adrian Caldwell",
"Ratko Varda",
"Ryan Bowen",
"Cedric Hunter",
]
# This also works:
# mapper = {name: order for order, name in enumerate(reorder)}
# return column.map(mapper)
cat = pd.Categorical(column, categories=reorder, ordered=True)
return pd.Series(cat)
df_sorted = df.sort_values(by="Player", key=sorter)
There may be some practical differences between using pd.Categorical
and the column.map
alternative I put in the comments. For example, see these caveats. I’m showing both for completeness. I also haven’t tested how this compares performance-wise to the current accepted solution that uses df.reindex
. The best approach might be different when you have a MultiIndex
in play too.
To sort in arbirtary order while not including blank rows I found df.filter
to work while testing out BENYS answer . It sorts as desired, ignores missing keys like df.reindex
, but helpfully does not include empty rows for keys that have no data.
df.filter(reorderlist, axis=0)
id Year Age Tm G
Player
Maurice Baker 5335 2004 25 VAN 7
Adrian Caldwell 6169 1997 31 DAL 81
Ratko Varda 13950 2001 22 TOT 60
Ryan Bowen 6141 2009 34 OKC 52
Cedric Hunter 2967 1991 27 CHH 6
#Extra keys dont add empty rows, missing keys ignored
reorderlist.append('LeBron James')
reorderlist.remove('Adrian Caldwell')
df.filter(reorderlist, axis=0)
id Year Age Tm G
Player
Maurice Baker 5335 2004 25 VAN 7
Ratko Varda 13950 2001 22 TOT 60
Ryan Bowen 6141 2009 34 OKC 52
Cedric Hunter 2967 1991 27 CHH 6
If there are more than one columns that need to be sort, in my experience, I use map
to convert string
value to number
. Then use sort_values
:
# Step 1/3: create dictionary to convert any string to number
convert_dict = {'Maurice Baker':1,
'Adrian Caldwell':2,
'Ratko Varda':3} # You can start filling till the end
# Step 2/3: Create column `new` that mapping from `Player`:
df['new'] = df['Player'].map(convert_dict)
# Step 3/3: sort
df.sort_values(by=['new'], ignore_index=True, inplace=True)
df.drop(columns=['new'], inplace=True)
I have the following data frame:
import pandas as pd
# Create DataFrame
df = pd.DataFrame(
{'id':[2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker' ,
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991 ,2004 ,2001 ,2009 ,1997],
'Age': [27 ,25 ,22 ,34 ,31],
'Tm':['CHH' ,'VAN' ,'TOT' ,'OKC' ,'DAL'],
'G':[6 ,7 ,60 ,52 ,81]})
df.set_index('Player', inplace=True)
It shows:
Out[128]:
Age G Tm Year id
Player
Cedric Hunter 27 6 CHH 1991 2967
Maurice Baker 25 7 VAN 2004 5335
Ratko Varda 22 60 TOT 2001 13950
Ryan Bowen 34 52 OKC 2009 6141
Adrian Caldwell 31 81 DAL 1997 6169
What I want to do is to sort the ‘Player’ index in the arbitrary order according to this list (NOTE: not alphabetical order):
reorderlist = [ 'Maurice Baker', 'Adrian Caldwell','Ratko Varda' ,'Ryan Bowen' ,'Cedric Hunter']
How can I do that?
To get a custom sort-order on your list of strings, declare it as a categorical and manually specify that order in a sort:
player_order = pd.Categorical([ 'Maurice Baker', 'Adrian Caldwell','Ratko Varda' ,'Ryan Bowen' ,'Cedric Hunter'],
ordered=True)
This is since pandas does not yet allow Categoricals as indices: df.set_index(keys=player_order, inplace=True)
TypeError: unhashable type: 'Categorical'
So you’ll want to do a manual custom sort using df.sort_index(level=player_order)
Just reindex
df.reindex(reorderlist)
Out[89]:
Age G Tm Year id
Player
Maurice Baker 25 7 VAN 2004 5335
Adrian Caldwell 31 81 DAL 1997 6169
Ratko Varda 22 60 TOT 2001 13950
Ryan Bowen 34 52 OKC 2009 6141
Cedric Hunter 27 6 CHH 1991 2967
Update info you have multiple players with same name
out = df.iloc[pd.Categorical(df.index,reorderlist).argsort()]
As of Pandas 1.1 DataFrame.sort_values has a key
param that takes a callable to control sorting. So you could use an approach like the following:
def sorter(column):
reorder = [
"Maurice Baker",
"Adrian Caldwell",
"Ratko Varda",
"Ryan Bowen",
"Cedric Hunter",
]
# This also works:
# mapper = {name: order for order, name in enumerate(reorder)}
# return column.map(mapper)
cat = pd.Categorical(column, categories=reorder, ordered=True)
return pd.Series(cat)
df_sorted = df.sort_values(by="Player", key=sorter)
There may be some practical differences between using pd.Categorical
and the column.map
alternative I put in the comments. For example, see these caveats. I’m showing both for completeness. I also haven’t tested how this compares performance-wise to the current accepted solution that uses df.reindex
. The best approach might be different when you have a MultiIndex
in play too.
To sort in arbirtary order while not including blank rows I found df.filter
to work while testing out BENYS answer . It sorts as desired, ignores missing keys like df.reindex
, but helpfully does not include empty rows for keys that have no data.
df.filter(reorderlist, axis=0)
id Year Age Tm G
Player
Maurice Baker 5335 2004 25 VAN 7
Adrian Caldwell 6169 1997 31 DAL 81
Ratko Varda 13950 2001 22 TOT 60
Ryan Bowen 6141 2009 34 OKC 52
Cedric Hunter 2967 1991 27 CHH 6
#Extra keys dont add empty rows, missing keys ignored
reorderlist.append('LeBron James')
reorderlist.remove('Adrian Caldwell')
df.filter(reorderlist, axis=0)
id Year Age Tm G
Player
Maurice Baker 5335 2004 25 VAN 7
Ratko Varda 13950 2001 22 TOT 60
Ryan Bowen 6141 2009 34 OKC 52
Cedric Hunter 2967 1991 27 CHH 6
If there are more than one columns that need to be sort, in my experience, I use map
to convert string
value to number
. Then use sort_values
:
# Step 1/3: create dictionary to convert any string to number
convert_dict = {'Maurice Baker':1,
'Adrian Caldwell':2,
'Ratko Varda':3} # You can start filling till the end
# Step 2/3: Create column `new` that mapping from `Player`:
df['new'] = df['Player'].map(convert_dict)
# Step 3/3: sort
df.sort_values(by=['new'], ignore_index=True, inplace=True)
df.drop(columns=['new'], inplace=True)