Sort a Pandas Dataframe by Multiple Columns Using Key Argument

Question:

I have a dataframe a pandas dataframe with the following columns:

df = pd.DataFrame([
    ['A2', 2],
    ['B1', 1],
    ['A1', 2],
    ['A2', 1],
    ['B1', 2],
    ['A1', 1]], 
  columns=['one','two'])

Which I am hoping to sort primarily by column ‘two’, then by column ‘one’. For the secondary sort, I would like to use a custom sorting rule that will sort column ‘one’ by the alphabetic character [A-Z] and then the trailing numeric number [0-100]. So, the outcome of the sort would be:

one two
 A1   1
 B1   1
 A2   1
 A1   2
 B1   2
 A2   2

I have sorted a list of strings similar to column ‘one’ before using a sorting rule like so:

def custom_sort(value):
    return (value[0], int(value[1:]))

my_list.sort(key=custom_sort)

If I try to apply this rule via a pandas sort, I run into a number of issues including:

  1. The pandas DataFrame.sort_values() function accepts a key for sorting like the sort() function, but the key function should be vectorized (per the pandas documentation). If I try to apply the sorting key to only column ‘one’, I get the error "TypeError: cannot convert the series to <class ‘int’>"
  2. When you use the pandas DataFrame.sort_values() method, it applies the sort key to all columns you pass in. This will not work since I want to sort first by the column ‘two’ using a native numerical sort.

How would I go about sorting the DataFrame as mentioned above?

Asked By: user11058068

||

Answers:

You can split column one into its constituent parts, add them as columns to the dataframe and then sort on them with column two. Finally, remove the temporary columns.

>>> (df.assign(lhs=df['one'].str[0], rhs=df['one'].str[1:].astype(int))
       .sort_values(['two', 'rhs', 'lhs'])
       .drop(columns=['lhs', 'rhs']))
  one  two
5  A1    1
1  B1    1
3  A2    1
2  A1    2
4  B1    2
0  A2    2
Answered By: Alexander

use str.extract to create some temp columns that are based off 1) alphabet (a-zA-Z]+) and 2) Number (d+) and then drop them:

df = pd.DataFrame([
    ['A2', 2],
    ['B1', 1],
    ['A1', 2],
    ['A2', 1],
    ['B1', 2],
    ['A1', 1]], 
  columns=['one','two'])

df['one-letter'] = df['one'].str.extract('([a-zA-Z]+)')
df['one-number'] = df['one'].str.extract('(d+)')
df = df.sort_values(['two', 'one-number', 'one-letter']).drop(['one-letter', 'one-number'], axis=1)
df
Out[38]: 
  one  two
5  A1    1
1  B1    1
3  A2    1
2  A1    2
4  B1    2
Answered By: David Erickson

One of the solutions is to make both columns pd.Categorical and pass the expected order as an argument "categories".

But I have some requirements where I cannot coerce unknownunexpected values and unfortunately that is what pd.Categorical is doing. Also None is not supported as a category and coerced automatically.

So my solution was to use a key to sort on multiple columns with a custom sorting order:

import pandas as pd


df = pd.DataFrame([
    [A2, 2],
    [B1, 1],
    [A1, 2],
    [A2, 1],
    [B1, 2],
    [A1, 1]], 
  columns=['one','two'])


def custom_sorting(col: pd.Series) -> pd.Series:
    """Series is input and ordered series is expected as output"""
    to_ret = col
    # apply custom sorting only to column one:
    if col.name == "one":
        custom_dict = {}
        # for example ensure that A2 is first, pass items in sorted order here:
        def custom_sort(value):
            return (value[0], int(value[1:]))

        ordered_items = list(col.unique())
        ordered_items.sort(key=custom_sort)
        # apply custom order first:
        for index, item in enumerate(ordered_items):
            custom_dict[item] = index
        to_ret = col.map(custom_dict)
    # default text sorting is about to be applied
    return to_ret


# pass two columns to be sorted
df.sort_values(
    by=["two", "one"],
    ascending=True,
    inplace=True,
    key=custom_sorting,
)

print(df)

Output:

5  A1    1
3  A2    1
1  B1    1
2  A1    2
0  A2    2
4  B1    2

Be aware that this solution can be slow.

Answered By: Ievgen

With pandas >= 1.1.0 and natsort, you can also do this now:

import natsort

sorted_df = df.sort_values(["one", "two"], key=natsort.natsort_keygen())
Answered By: Akaisteph7

I have created a function to solve the issue of using key argument for multi-column, following the suggestion from @Alexander. Also it deals with not duplicating names when creating temporal columns. Furthermore, it can sort the whole dataframe including the index (using the index.names).

It can be improved, but using copy-paste should work:

https://github.com/DavidDB33/pandas_helpers/blob/main/pandas_helpers/helpers.py

Answered By: David Domínguez
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.