Sorting a csv in Python – splitting first column with string delimiter then ordering via the second element then first element

Question:

Say I have a csv such as:

Field1,Field2,Field3,Field4
bitter-banana,yellow,1,10
tasty-banana,green,1,7
bad-banana,yellow,2,11
tasty-apple,green,10,5
bad-apple,red,9,4
bitter-apple,green,1,7

How could I sort the the csv on the first column only whereby the part of the string that comes after the - delimiter is ordered alphabetically first. Then those strings that are identical after the - are sub ordered in groups alphabetically for the string before the - delimiter.

Output would be:

Field1,Field2,Field3,Field4
bad-apple,red,9,4
bitter-apple,green,1,7
tasty-apple,green,10,5
bad-banana,yellow,2,11
bitter-banana,yellow,1,10
tasty-banana,green,1,7

I could use Pandas sort_values for the first sorting requirement such as:

df = df.sort_values(by=["Field1"], key=lambda x: x.str.split("-").str[1], ascending=True)

But I am not sure how how to tackle the second requirement of my question.

Asked By: CB990

||

Answers:

Split and expand the Field1 into two columns then sort the dataframe using those columns

df.loc[df['Field1'].str.split('-', expand=True).sort_values([1, 0]).index]

          Field1  Field2  Field3  Field4
4      bad-apple     red       9       4
5   bitter-apple   green       1       7
3    tasty-apple   green      10       5
2     bad-banana  yellow       2      11
0  bitter-banana  yellow       1      10
1   tasty-banana   green       1       7
Answered By: Shubham Sharma

Or, split, reverse elements, join, and sort on that key:

df.sort_values('Field1', key=lambda x: x.str.split('-').str[::-1].str.join('-'))

Output:

          Field1  Field2  Field3  Field4
4      bad-apple     red       9       4
5   bitter-apple   green       1       7
3    tasty-apple   green      10       5
2     bad-banana  yellow       2      11
0  bitter-banana  yellow       1      10
1   tasty-banana   green       1       7

Note: strangely enough, this appears to slightly more performant that split and sort_values:

%timeit df.sort_values('Field1', key=lambda x: x.str.split('-').str[::-1].str.join('-'))

771 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops
each)

%timeit df.loc[df['Field1'].str.split('-', expand=True).sort_values([1, 0]).index]

1.39 ms ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You’ll have to test on real data or much larger datasets to see which performs better.

Answered By: Scott Boston
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.