Sort a Pandas Dataframe by Multiple Columns Using Key Argument
Question:
I have a dataframe a pandas dataframe with the following columns:
df = pd.DataFrame([
['A2', 2],
['B1', 1],
['A1', 2],
['A2', 1],
['B1', 2],
['A1', 1]],
columns=['one','two'])
Which I am hoping to sort primarily by column ‘two’, then by column ‘one’. For the secondary sort, I would like to use a custom sorting rule that will sort column ‘one’ by the alphabetic character [A-Z]
and then the trailing numeric number [0-100]
. So, the outcome of the sort would be:
one two
A1 1
B1 1
A2 1
A1 2
B1 2
A2 2
I have sorted a list of strings similar to column ‘one’ before using a sorting rule like so:
def custom_sort(value):
return (value[0], int(value[1:]))
my_list.sort(key=custom_sort)
If I try to apply this rule via a pandas sort, I run into a number of issues including:
- The pandas
DataFrame.sort_values()
function accepts a key for sorting like the sort() function, but the key function should be vectorized (per the pandas documentation). If I try to apply the sorting key to only column ‘one’, I get the error "TypeError: cannot convert the series to <class ‘int’>"
- When you use the pandas
DataFrame.sort_values()
method, it applies the sort key to all columns you pass in. This will not work since I want to sort first by the column ‘two’ using a native numerical sort.
How would I go about sorting the DataFrame as mentioned above?
Answers:
You can split column one
into its constituent parts, add them as columns to the dataframe and then sort on them with column two
. Finally, remove the temporary columns.
>>> (df.assign(lhs=df['one'].str[0], rhs=df['one'].str[1:].astype(int))
.sort_values(['two', 'rhs', 'lhs'])
.drop(columns=['lhs', 'rhs']))
one two
5 A1 1
1 B1 1
3 A2 1
2 A1 2
4 B1 2
0 A2 2
use str.extract
to create some temp columns that are based off 1) alphabet (a-zA-Z]+)
and 2) Number (d+)
and then drop them:
df = pd.DataFrame([
['A2', 2],
['B1', 1],
['A1', 2],
['A2', 1],
['B1', 2],
['A1', 1]],
columns=['one','two'])
df['one-letter'] = df['one'].str.extract('([a-zA-Z]+)')
df['one-number'] = df['one'].str.extract('(d+)')
df = df.sort_values(['two', 'one-number', 'one-letter']).drop(['one-letter', 'one-number'], axis=1)
df
Out[38]:
one two
5 A1 1
1 B1 1
3 A2 1
2 A1 2
4 B1 2
One of the solutions is to make both columns pd.Categorical and pass the expected order as an argument "categories".
But I have some requirements where I cannot coerce unknownunexpected values and unfortunately that is what pd.Categorical is doing. Also None is not supported as a category and coerced automatically.
So my solution was to use a key to sort on multiple columns with a custom sorting order:
import pandas as pd
df = pd.DataFrame([
[A2, 2],
[B1, 1],
[A1, 2],
[A2, 1],
[B1, 2],
[A1, 1]],
columns=['one','two'])
def custom_sorting(col: pd.Series) -> pd.Series:
"""Series is input and ordered series is expected as output"""
to_ret = col
# apply custom sorting only to column one:
if col.name == "one":
custom_dict = {}
# for example ensure that A2 is first, pass items in sorted order here:
def custom_sort(value):
return (value[0], int(value[1:]))
ordered_items = list(col.unique())
ordered_items.sort(key=custom_sort)
# apply custom order first:
for index, item in enumerate(ordered_items):
custom_dict[item] = index
to_ret = col.map(custom_dict)
# default text sorting is about to be applied
return to_ret
# pass two columns to be sorted
df.sort_values(
by=["two", "one"],
ascending=True,
inplace=True,
key=custom_sorting,
)
print(df)
Output:
5 A1 1
3 A2 1
1 B1 1
2 A1 2
0 A2 2
4 B1 2
Be aware that this solution can be slow.
With pandas >= 1.1.0 and natsort, you can also do this now:
import natsort
sorted_df = df.sort_values(["one", "two"], key=natsort.natsort_keygen())
I have created a function to solve the issue of using key argument for multi-column, following the suggestion from @Alexander. Also it deals with not duplicating names when creating temporal columns. Furthermore, it can sort the whole dataframe including the index (using the index.names).
It can be improved, but using copy-paste should work:
https://github.com/DavidDB33/pandas_helpers/blob/main/pandas_helpers/helpers.py
I have a dataframe a pandas dataframe with the following columns:
df = pd.DataFrame([
['A2', 2],
['B1', 1],
['A1', 2],
['A2', 1],
['B1', 2],
['A1', 1]],
columns=['one','two'])
Which I am hoping to sort primarily by column ‘two’, then by column ‘one’. For the secondary sort, I would like to use a custom sorting rule that will sort column ‘one’ by the alphabetic character [A-Z]
and then the trailing numeric number [0-100]
. So, the outcome of the sort would be:
one two
A1 1
B1 1
A2 1
A1 2
B1 2
A2 2
I have sorted a list of strings similar to column ‘one’ before using a sorting rule like so:
def custom_sort(value):
return (value[0], int(value[1:]))
my_list.sort(key=custom_sort)
If I try to apply this rule via a pandas sort, I run into a number of issues including:
- The pandas
DataFrame.sort_values()
function accepts a key for sorting like the sort() function, but the key function should be vectorized (per the pandas documentation). If I try to apply the sorting key to only column ‘one’, I get the error "TypeError: cannot convert the series to <class ‘int’>" - When you use the pandas
DataFrame.sort_values()
method, it applies the sort key to all columns you pass in. This will not work since I want to sort first by the column ‘two’ using a native numerical sort.
How would I go about sorting the DataFrame as mentioned above?
You can split column one
into its constituent parts, add them as columns to the dataframe and then sort on them with column two
. Finally, remove the temporary columns.
>>> (df.assign(lhs=df['one'].str[0], rhs=df['one'].str[1:].astype(int))
.sort_values(['two', 'rhs', 'lhs'])
.drop(columns=['lhs', 'rhs']))
one two
5 A1 1
1 B1 1
3 A2 1
2 A1 2
4 B1 2
0 A2 2
use str.extract
to create some temp columns that are based off 1) alphabet (a-zA-Z]+)
and 2) Number (d+)
and then drop them:
df = pd.DataFrame([
['A2', 2],
['B1', 1],
['A1', 2],
['A2', 1],
['B1', 2],
['A1', 1]],
columns=['one','two'])
df['one-letter'] = df['one'].str.extract('([a-zA-Z]+)')
df['one-number'] = df['one'].str.extract('(d+)')
df = df.sort_values(['two', 'one-number', 'one-letter']).drop(['one-letter', 'one-number'], axis=1)
df
Out[38]:
one two
5 A1 1
1 B1 1
3 A2 1
2 A1 2
4 B1 2
One of the solutions is to make both columns pd.Categorical and pass the expected order as an argument "categories".
But I have some requirements where I cannot coerce unknownunexpected values and unfortunately that is what pd.Categorical is doing. Also None is not supported as a category and coerced automatically.
So my solution was to use a key to sort on multiple columns with a custom sorting order:
import pandas as pd
df = pd.DataFrame([
[A2, 2],
[B1, 1],
[A1, 2],
[A2, 1],
[B1, 2],
[A1, 1]],
columns=['one','two'])
def custom_sorting(col: pd.Series) -> pd.Series:
"""Series is input and ordered series is expected as output"""
to_ret = col
# apply custom sorting only to column one:
if col.name == "one":
custom_dict = {}
# for example ensure that A2 is first, pass items in sorted order here:
def custom_sort(value):
return (value[0], int(value[1:]))
ordered_items = list(col.unique())
ordered_items.sort(key=custom_sort)
# apply custom order first:
for index, item in enumerate(ordered_items):
custom_dict[item] = index
to_ret = col.map(custom_dict)
# default text sorting is about to be applied
return to_ret
# pass two columns to be sorted
df.sort_values(
by=["two", "one"],
ascending=True,
inplace=True,
key=custom_sorting,
)
print(df)
Output:
5 A1 1
3 A2 1
1 B1 1
2 A1 2
0 A2 2
4 B1 2
Be aware that this solution can be slow.
With pandas >= 1.1.0 and natsort, you can also do this now:
import natsort
sorted_df = df.sort_values(["one", "two"], key=natsort.natsort_keygen())
I have created a function to solve the issue of using key argument for multi-column, following the suggestion from @Alexander. Also it deals with not duplicating names when creating temporal columns. Furthermore, it can sort the whole dataframe including the index (using the index.names).
It can be improved, but using copy-paste should work:
https://github.com/DavidDB33/pandas_helpers/blob/main/pandas_helpers/helpers.py