sorting by a custom list in pandas
Question:
After reading through: http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.sort.html
I still can’t seem to figure out how to sort a column by a custom list. Obviously, the default sort is alphabetical. I’ll give an example. Here is my (very abridged) dataframe:
Player Year Age Tm G
2967 Cedric Hunter 1991 27 CHH 6
5335 Maurice Baker 2004 25 VAN 7
13950 Ratko Varda 2001 22 TOT 60
6141 Ryan Bowen 2009 34 OKC 52
6169 Adrian Caldwell 1997 31 DAL 81
I want to be able to sort by Player, Year and then Tm. The default sort by Player and Year is fine for me, in normal order. However, I do not want Team sorted alphabetically b/c I want TOT always at the top.
Here is the list I created:
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
'WAS', 'WSB']
After reading through the link above, I thought this would work but it didn’t:
df.sort(['Player', 'Year', 'Tm'], ascending = [True, True, sorter])
It still has ATL at the top, meaning that it sorted alphabetically and not according to my custom list. Any help would really be greatly appreciated, I just can’t figure this out.
Answers:
Below is an example that performs lexicographic sort on a dataframe.
The idea is to create an numerical index based on the specific sort.
Then to perform a numerical sort based on the index.
A column is added to the dataframe to do so, and is then removed.
import pandas as pd
# Create DataFrame
df = pd.DataFrame(
{'id':[2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker',
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991, 2004, 2001, 2009, 1997],
'Age': [27, 25, 22, 34, 31],
'Tm': ['CHH' ,'VAN' ,'TOT' ,'OKC', 'DAL'],
'G': [6, 7, 60, 52, 81]})
# Define the sorter
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL','DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
'WAS', 'WSB']
# Create the dictionary that defines the order for sorting
sorterIndex = dict(zip(sorter, range(len(sorter))))
# Generate a rank column that will be used to sort
# the dataframe numerically
df['Tm_Rank'] = df['Tm'].map(sorterIndex)
# Here is the result asked with the lexicographic sort
# Result may be hard to analyze, so a second sorting is
# proposed next
## NOTE:
## Newer versions of pandas use 'sort_values' instead of 'sort'
df.sort_values(['Player', 'Year', 'Tm_Rank'],
ascending = [True, True, True], inplace = True)
df.drop('Tm_Rank', 1, inplace = True)
print(df)
# Here is an example where 'Tm' is sorted first, that will
# give the first row of the DataFrame df to contain TOT as 'Tm'
df['Tm_Rank'] = df['Tm'].map(sorterIndex)
## NOTE:
## Newer versions of pandas use 'sort_values' instead of 'sort'
df.sort_values(['Tm_Rank', 'Player', 'Year'],
ascending = [True , True, True], inplace = True)
df.drop('Tm_Rank', 1, inplace = True)
print(df)
The below answer is an old answer. It still works. Anyhow, another very elegant solution has been posted below , using the key
argument.
I just discovered that with pandas 15.1 it is possible to use categorical series (https://pandas.pydata.org/docs/user_guide/categorical.html)
As for your example, lets define the same data-frame and sorter:
import pandas as pd
data = {
'id': [2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker',
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991, 2004, 2001, 2009, 1997],
'Age': [27, 25, 22, 34, 31],
'Tm': ['CHH', 'VAN', 'TOT', 'OKC', 'DAL'],
'G': [6, 7, 60, 52, 81]
}
# Create DataFrame
df = pd.DataFrame(data)
# Define the sorter
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB']
With the data-frame and sorter, which is a category-order, we can do the following in pandas 15.1:
# Convert Tm-column to category and in set the sorter as categories hierarchy
# Youc could also do both lines in one just appending the cat.set_categories()
df.Tm = df.Tm.astype("category")
df.Tm = f.Tm.cat.set_categories(sorter)
print(df.Tm)
Out[48]:
0 CHH
1 VAN
2 TOT
3 OKC
4 DAL
Name: Tm, dtype: category
Categories (38, object): [TOT < ATL < BOS < BRK ... UTA < VAN < WAS < WSB]
df.sort_values(["Tm"]) ## 'sort' changed to 'sort_values'
Out[49]:
Age G Player Tm Year id
2 22 60 Ratko Varda TOT 2001 13950
0 27 6 Cedric Hunter CHH 1991 2967
4 31 81 Adrian Caldwell DAL 1997 6169
3 34 52 Ryan Bowen OKC 2009 6141
1 25 7 Maurice Baker VAN 2004 5335
My idea is generate sort number by index, then merge sort number into original dataframe
import pandas as pd
df = pd.DataFrame(
{'id':[2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker' ,
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991 ,2004 ,2001 ,2009 ,1997],
'Age': [27 ,25 ,22 ,34 ,31],
'Tm':['CHH' ,'VAN' ,'TOT' ,'OKC' ,'DAL'],
'G':[6 ,7 ,60 ,52 ,81]})
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
'WAS', 'WSB']
x = pd.DataFrame({'Tm': sorter})
x.index = x.index.set_names('number')
x = x.reset_index()
df = pd.merge(df, x, how='left', on='Tm')
df.sort_values(['Player', 'Year', 'number'],
ascending = [True, True, True], inplace = True)
df.drop('number', 1, inplace = True)
Setting the index then DataFrame.loc
is useful when you need to order by a single custom list. Because loc
will create NaN
rows for values in sorter
that aren’t in the DataFrame we’ll first find the intersection. This prevents any unwanted upcasting. Any rows with values not in the list are removed.
true_sort = [s for s in sorter if s in df.Tm.unique()]
df = df.set_index('Tm').loc[true_sort].reset_index()
Tm id Player Year Age G
0 TOT 13950 Ratko Varda 2001 22 60
1 CHH 2967 Cedric Hunter 1991 27 6
2 DAL 6169 Adrian Caldwell 1997 31 81
3 OKC 6141 Ryan Bowen 2009 34 52
4 VAN 5335 Maurice Baker 2004 25 7
Starting Data:
print(df)
id Player Year Age Tm G
0 2967 Cedric Hunter 1991 27 CHH 6
1 5335 Maurice Baker 2004 25 VAN 7
2 13950 Ratko Varda 2001 22 TOT 60
3 6141 Ryan Bowen 2009 34 OKC 52
4 6169 Adrian Caldwell 1997 31 DAL 81
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB']
According to pandas 1.1.0 documentation, it has become possible to sort with key
parameter like in sorted
function (finally!). Here how we can sort by Tm
import pandas as pd
data = {
'id': [2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker',
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991, 2004, 2001, 2009, 1997],
'Age': [27, 25, 22, 34, 31],
'Tm': ['CHH', 'VAN', 'TOT', 'OKC', 'DAL'],
'G': [6, 7, 60, 52, 81]
}
# Create DataFrame
df = pd.DataFrame(data)
def tm_sorter(column):
"""Sort function"""
teams = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
'WAS', 'WSB']
correspondence = {team: order for order, team in enumerate(teams)}
return column.map(correspondence)
df.sort_values(by='Tm', key=tm_sorter)
Sadly, it looks like we can use this feature only in sorting by 1 column (list with key
s is not acceptable). It can be circumvented by groupby
df.sort_values(['Player', 'Year'])
.groupby(['Player', 'Year'])
.apply(lambda x: x.sort_values(by='Tm', key=tm_sorter))
.reset_index(drop=True)
If you know how to use key
in sort_values
with multiple columns, tell me please
df1 = df.set_index('Tm')
df1.loc[sorter]
as @kstajer commented, after pandas 1.0.0, use reindex instead:
df1.reindex(sorter)
This does the job in just a couple of lines
# Create a dummy df with the required list and the col name to sort on
dummy = pd.Series(sort_list, name = col_name).to_frame()
# Use left merge on the dummy to return a sorted df
sorted_df = pd.merge(dummy, df, on = col_name, how = 'left')
Partial solution for those only interested in sorting by categorical columns:
You can do this with a helper function that creates a sort order mapper from a custom list.
This example only includes values from one column, however it could be extended to include other columns by creating a custom order list that includes values that occur in all columns. Naturally, since you must construct your custom list with all possible values in your sort field, this is good mostly for categorical sorting and would not be suitable for continuous variables (unless the possible values are known up front) and columns with a very high cardinality.
import pandas as pd
# set up a dummy dataframe
df = pd.DataFrame({'a':list('abcde'), 'b':range(5)})
# helper function
def make_sorter(l):
"""
Create a dict from the list to map to 0..len(l)
Returns a mapper to map a series to this custom sort order
"""
sort_order = {k:v for k,v in zip(l, range(len(l)))}
return lambda s: s.map(lambda x: sort_order[x])
# define a custom sort order
my_order = list('bdeca')
df.sort_values('a', key=make_sorter(my_order))
a b
1 b 1
3 d 3
4 e 4
2 c 2
0 a 0
With OP’s data:
df = pd.DataFrame({
'id':[2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker',
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991, 2004, 2001, 2009, 1997],
'Age': [27, 25, 22, 34, 31],
'Tm': ['CHH' ,'VAN' ,'TOT' ,'OKC', 'DAL'],
'G': [6, 7, 60, 52, 81]
})
# Define the sorter
sorter = [
'TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL',
'DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA',
'MIL', 'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL',
'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
'WAS', 'WSB'
]
df.sort_values('Tm', key=make_sorter(sorter))
id Player Year Age Tm G
2 13950 Ratko Varda 2001 22 TOT 60
0 2967 Cedric Hunter 1991 27 CHH 6
4 6169 Adrian Caldwell 1997 31 DAL 81
3 6141 Ryan Bowen 2009 34 OKC 52
1 5335 Maurice Baker 2004 25 VAN 7
Since version 1.1.0 you can use the key
attribute to sort values:
df.sort_values(by="Tm", key=lambda column: column.map(lambda e: sorter.index(e)), inplace=True)
After reading through: http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.sort.html
I still can’t seem to figure out how to sort a column by a custom list. Obviously, the default sort is alphabetical. I’ll give an example. Here is my (very abridged) dataframe:
Player Year Age Tm G
2967 Cedric Hunter 1991 27 CHH 6
5335 Maurice Baker 2004 25 VAN 7
13950 Ratko Varda 2001 22 TOT 60
6141 Ryan Bowen 2009 34 OKC 52
6169 Adrian Caldwell 1997 31 DAL 81
I want to be able to sort by Player, Year and then Tm. The default sort by Player and Year is fine for me, in normal order. However, I do not want Team sorted alphabetically b/c I want TOT always at the top.
Here is the list I created:
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
'WAS', 'WSB']
After reading through the link above, I thought this would work but it didn’t:
df.sort(['Player', 'Year', 'Tm'], ascending = [True, True, sorter])
It still has ATL at the top, meaning that it sorted alphabetically and not according to my custom list. Any help would really be greatly appreciated, I just can’t figure this out.
Below is an example that performs lexicographic sort on a dataframe.
The idea is to create an numerical index based on the specific sort.
Then to perform a numerical sort based on the index.
A column is added to the dataframe to do so, and is then removed.
import pandas as pd
# Create DataFrame
df = pd.DataFrame(
{'id':[2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker',
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991, 2004, 2001, 2009, 1997],
'Age': [27, 25, 22, 34, 31],
'Tm': ['CHH' ,'VAN' ,'TOT' ,'OKC', 'DAL'],
'G': [6, 7, 60, 52, 81]})
# Define the sorter
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL','DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
'WAS', 'WSB']
# Create the dictionary that defines the order for sorting
sorterIndex = dict(zip(sorter, range(len(sorter))))
# Generate a rank column that will be used to sort
# the dataframe numerically
df['Tm_Rank'] = df['Tm'].map(sorterIndex)
# Here is the result asked with the lexicographic sort
# Result may be hard to analyze, so a second sorting is
# proposed next
## NOTE:
## Newer versions of pandas use 'sort_values' instead of 'sort'
df.sort_values(['Player', 'Year', 'Tm_Rank'],
ascending = [True, True, True], inplace = True)
df.drop('Tm_Rank', 1, inplace = True)
print(df)
# Here is an example where 'Tm' is sorted first, that will
# give the first row of the DataFrame df to contain TOT as 'Tm'
df['Tm_Rank'] = df['Tm'].map(sorterIndex)
## NOTE:
## Newer versions of pandas use 'sort_values' instead of 'sort'
df.sort_values(['Tm_Rank', 'Player', 'Year'],
ascending = [True , True, True], inplace = True)
df.drop('Tm_Rank', 1, inplace = True)
print(df)
The below answer is an old answer. It still works. Anyhow, another very elegant solution has been posted below , using the key
argument.
I just discovered that with pandas 15.1 it is possible to use categorical series (https://pandas.pydata.org/docs/user_guide/categorical.html)
As for your example, lets define the same data-frame and sorter:
import pandas as pd
data = {
'id': [2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker',
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991, 2004, 2001, 2009, 1997],
'Age': [27, 25, 22, 34, 31],
'Tm': ['CHH', 'VAN', 'TOT', 'OKC', 'DAL'],
'G': [6, 7, 60, 52, 81]
}
# Create DataFrame
df = pd.DataFrame(data)
# Define the sorter
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB']
With the data-frame and sorter, which is a category-order, we can do the following in pandas 15.1:
# Convert Tm-column to category and in set the sorter as categories hierarchy
# Youc could also do both lines in one just appending the cat.set_categories()
df.Tm = df.Tm.astype("category")
df.Tm = f.Tm.cat.set_categories(sorter)
print(df.Tm)
Out[48]:
0 CHH
1 VAN
2 TOT
3 OKC
4 DAL
Name: Tm, dtype: category
Categories (38, object): [TOT < ATL < BOS < BRK ... UTA < VAN < WAS < WSB]
df.sort_values(["Tm"]) ## 'sort' changed to 'sort_values'
Out[49]:
Age G Player Tm Year id
2 22 60 Ratko Varda TOT 2001 13950
0 27 6 Cedric Hunter CHH 1991 2967
4 31 81 Adrian Caldwell DAL 1997 6169
3 34 52 Ryan Bowen OKC 2009 6141
1 25 7 Maurice Baker VAN 2004 5335
My idea is generate sort number by index, then merge sort number into original dataframe
import pandas as pd
df = pd.DataFrame(
{'id':[2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker' ,
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991 ,2004 ,2001 ,2009 ,1997],
'Age': [27 ,25 ,22 ,34 ,31],
'Tm':['CHH' ,'VAN' ,'TOT' ,'OKC' ,'DAL'],
'G':[6 ,7 ,60 ,52 ,81]})
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
'WAS', 'WSB']
x = pd.DataFrame({'Tm': sorter})
x.index = x.index.set_names('number')
x = x.reset_index()
df = pd.merge(df, x, how='left', on='Tm')
df.sort_values(['Player', 'Year', 'number'],
ascending = [True, True, True], inplace = True)
df.drop('number', 1, inplace = True)
Setting the index then DataFrame.loc
is useful when you need to order by a single custom list. Because loc
will create NaN
rows for values in sorter
that aren’t in the DataFrame we’ll first find the intersection. This prevents any unwanted upcasting. Any rows with values not in the list are removed.
true_sort = [s for s in sorter if s in df.Tm.unique()]
df = df.set_index('Tm').loc[true_sort].reset_index()
Tm id Player Year Age G
0 TOT 13950 Ratko Varda 2001 22 60
1 CHH 2967 Cedric Hunter 1991 27 6
2 DAL 6169 Adrian Caldwell 1997 31 81
3 OKC 6141 Ryan Bowen 2009 34 52
4 VAN 5335 Maurice Baker 2004 25 7
Starting Data:
print(df)
id Player Year Age Tm G
0 2967 Cedric Hunter 1991 27 CHH 6
1 5335 Maurice Baker 2004 25 VAN 7
2 13950 Ratko Varda 2001 22 TOT 60
3 6141 Ryan Bowen 2009 34 OKC 52
4 6169 Adrian Caldwell 1997 31 DAL 81
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB']
According to pandas 1.1.0 documentation, it has become possible to sort with key
parameter like in sorted
function (finally!). Here how we can sort by Tm
import pandas as pd
data = {
'id': [2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker',
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991, 2004, 2001, 2009, 1997],
'Age': [27, 25, 22, 34, 31],
'Tm': ['CHH', 'VAN', 'TOT', 'OKC', 'DAL'],
'G': [6, 7, 60, 52, 81]
}
# Create DataFrame
df = pd.DataFrame(data)
def tm_sorter(column):
"""Sort function"""
teams = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
'WAS', 'WSB']
correspondence = {team: order for order, team in enumerate(teams)}
return column.map(correspondence)
df.sort_values(by='Tm', key=tm_sorter)
Sadly, it looks like we can use this feature only in sorting by 1 column (list with key
s is not acceptable). It can be circumvented by groupby
df.sort_values(['Player', 'Year'])
.groupby(['Player', 'Year'])
.apply(lambda x: x.sort_values(by='Tm', key=tm_sorter))
.reset_index(drop=True)
If you know how to use key
in sort_values
with multiple columns, tell me please
df1 = df.set_index('Tm')
df1.loc[sorter]
as @kstajer commented, after pandas 1.0.0, use reindex instead:
df1.reindex(sorter)
This does the job in just a couple of lines
# Create a dummy df with the required list and the col name to sort on
dummy = pd.Series(sort_list, name = col_name).to_frame()
# Use left merge on the dummy to return a sorted df
sorted_df = pd.merge(dummy, df, on = col_name, how = 'left')
Partial solution for those only interested in sorting by categorical columns:
You can do this with a helper function that creates a sort order mapper from a custom list.
This example only includes values from one column, however it could be extended to include other columns by creating a custom order list that includes values that occur in all columns. Naturally, since you must construct your custom list with all possible values in your sort field, this is good mostly for categorical sorting and would not be suitable for continuous variables (unless the possible values are known up front) and columns with a very high cardinality.
import pandas as pd
# set up a dummy dataframe
df = pd.DataFrame({'a':list('abcde'), 'b':range(5)})
# helper function
def make_sorter(l):
"""
Create a dict from the list to map to 0..len(l)
Returns a mapper to map a series to this custom sort order
"""
sort_order = {k:v for k,v in zip(l, range(len(l)))}
return lambda s: s.map(lambda x: sort_order[x])
# define a custom sort order
my_order = list('bdeca')
df.sort_values('a', key=make_sorter(my_order))
a b
1 b 1
3 d 3
4 e 4
2 c 2
0 a 0
With OP’s data:
df = pd.DataFrame({
'id':[2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker',
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991, 2004, 2001, 2009, 1997],
'Age': [27, 25, 22, 34, 31],
'Tm': ['CHH' ,'VAN' ,'TOT' ,'OKC', 'DAL'],
'G': [6, 7, 60, 52, 81]
})
# Define the sorter
sorter = [
'TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL',
'DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA',
'MIL', 'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL',
'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
'WAS', 'WSB'
]
df.sort_values('Tm', key=make_sorter(sorter))
id Player Year Age Tm G
2 13950 Ratko Varda 2001 22 TOT 60
0 2967 Cedric Hunter 1991 27 CHH 6
4 6169 Adrian Caldwell 1997 31 DAL 81
3 6141 Ryan Bowen 2009 34 OKC 52
1 5335 Maurice Baker 2004 25 VAN 7
Since version 1.1.0 you can use the key
attribute to sort values:
df.sort_values(by="Tm", key=lambda column: column.map(lambda e: sorter.index(e)), inplace=True)