Sorting a csv in Python – splitting first column with string delimiter then ordering via the second element then first element
Question:
Say I have a csv such as:
Field1,Field2,Field3,Field4
bitter-banana,yellow,1,10
tasty-banana,green,1,7
bad-banana,yellow,2,11
tasty-apple,green,10,5
bad-apple,red,9,4
bitter-apple,green,1,7
How could I sort the the csv on the first column only whereby the part of the string that comes after the -
delimiter is ordered alphabetically first. Then those strings that are identical after the -
are sub ordered in groups alphabetically for the string before the -
delimiter.
Output would be:
Field1,Field2,Field3,Field4
bad-apple,red,9,4
bitter-apple,green,1,7
tasty-apple,green,10,5
bad-banana,yellow,2,11
bitter-banana,yellow,1,10
tasty-banana,green,1,7
I could use Pandas sort_values
for the first sorting requirement such as:
df = df.sort_values(by=["Field1"], key=lambda x: x.str.split("-").str[1], ascending=True)
But I am not sure how how to tackle the second requirement of my question.
Answers:
Split and expand the Field1
into two columns then sort the dataframe using those columns
df.loc[df['Field1'].str.split('-', expand=True).sort_values([1, 0]).index]
Field1 Field2 Field3 Field4
4 bad-apple red 9 4
5 bitter-apple green 1 7
3 tasty-apple green 10 5
2 bad-banana yellow 2 11
0 bitter-banana yellow 1 10
1 tasty-banana green 1 7
Or, split, reverse elements, join, and sort on that key:
df.sort_values('Field1', key=lambda x: x.str.split('-').str[::-1].str.join('-'))
Output:
Field1 Field2 Field3 Field4
4 bad-apple red 9 4
5 bitter-apple green 1 7
3 tasty-apple green 10 5
2 bad-banana yellow 2 11
0 bitter-banana yellow 1 10
1 tasty-banana green 1 7
Note: strangely enough, this appears to slightly more performant that split and sort_values:
%timeit df.sort_values('Field1', key=lambda x: x.str.split('-').str[::-1].str.join('-'))
771 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops
each)
%timeit df.loc[df['Field1'].str.split('-', expand=True).sort_values([1, 0]).index]
1.39 ms ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You’ll have to test on real data or much larger datasets to see which performs better.
Say I have a csv such as:
Field1,Field2,Field3,Field4
bitter-banana,yellow,1,10
tasty-banana,green,1,7
bad-banana,yellow,2,11
tasty-apple,green,10,5
bad-apple,red,9,4
bitter-apple,green,1,7
How could I sort the the csv on the first column only whereby the part of the string that comes after the -
delimiter is ordered alphabetically first. Then those strings that are identical after the -
are sub ordered in groups alphabetically for the string before the -
delimiter.
Output would be:
Field1,Field2,Field3,Field4
bad-apple,red,9,4
bitter-apple,green,1,7
tasty-apple,green,10,5
bad-banana,yellow,2,11
bitter-banana,yellow,1,10
tasty-banana,green,1,7
I could use Pandas sort_values
for the first sorting requirement such as:
df = df.sort_values(by=["Field1"], key=lambda x: x.str.split("-").str[1], ascending=True)
But I am not sure how how to tackle the second requirement of my question.
Split and expand the Field1
into two columns then sort the dataframe using those columns
df.loc[df['Field1'].str.split('-', expand=True).sort_values([1, 0]).index]
Field1 Field2 Field3 Field4
4 bad-apple red 9 4
5 bitter-apple green 1 7
3 tasty-apple green 10 5
2 bad-banana yellow 2 11
0 bitter-banana yellow 1 10
1 tasty-banana green 1 7
Or, split, reverse elements, join, and sort on that key:
df.sort_values('Field1', key=lambda x: x.str.split('-').str[::-1].str.join('-'))
Output:
Field1 Field2 Field3 Field4
4 bad-apple red 9 4
5 bitter-apple green 1 7
3 tasty-apple green 10 5
2 bad-banana yellow 2 11
0 bitter-banana yellow 1 10
1 tasty-banana green 1 7
Note: strangely enough, this appears to slightly more performant that split and sort_values:
%timeit df.sort_values('Field1', key=lambda x: x.str.split('-').str[::-1].str.join('-'))
771 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops
each)
%timeit df.loc[df['Field1'].str.split('-', expand=True).sort_values([1, 0]).index]
1.39 ms ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You’ll have to test on real data or much larger datasets to see which performs better.