How can I change how Python sort deals with punctuation?

Question:

I’m currently trying to rewrite an R script in Python. I’ve been tripped up because it looks like R and Python sort some punctuation differently. Specifically ‘&’ and ‘_’. At some point in my program I sort by an identifier column in a Pandas dataframe.

As an example in Python:

t = ["1&2","1_2"]
sorted(t)

results in

['1&2', '1_2']

Comparatively in R:

t <- c("1&2","1_2")
sort(t)

results in

[1] "1_2" "1&2"

According to various resources (https://www.dconc.gov/home/showpublisheddocument/1481/635548543552170000) Python is doing the correct thing, but unfortunately I need to do the wrong thing here (changing R is not in scope).

Is there a straight forward way that I can change for Python would sort this? Specifically I’ll need to be able to do this on pandas dataframes when sorting by an ID column

Asked By: wfirth

||

Answers:

Use a custom key for sorting. Here, we can just swap the & and _. We do the swap by using list comprehension and breaking a string into a list of its individual characters, but we swap the & and _ characters. Then we rebuild the string with a ''.join‘.

t = ["1&2","1_2", "5&3"]
    
def swap_chars(s):
    return ''.join([c if 
                    c not in ['&', '_'] 
                    else '_' if c == '&' 
                    else '&' for c in s])
    
sorted(t, key = swap_chars)
Answered By: Michael Cao

You have the option of just skipping all the following text to FINALLY and use the provided code for sorting Python lists of strings like they would be sorted in R or learn a bit about Python reading the answer from top to bottom:

Like already mentioned in the comment to your question by Rawson (giving appropriate helpful link) you can define the order in which sorting should take place for any characters you choose to take out of the usual sorting order:

t = ['1&2', '1_2']
print(sorted(t))

alphabet = {"_":-2, "&":-1}
def sortkey(word):
    return [ alphabet.get(chr, ord(chr)) for chr in word ]
    # what means:
    # return [ alphabet[chr] if chr in alphabet else ord(chr) for chr in word ]

print(sortkey(t[0]), sortkey(t[1]))
print(sorted(t, key=sortkey))

gives:

['1&2', '1_2']
[49, -1, 50] [49, -2, 50]
['1_2', '1&2']

Use negative values to define the alphabet order so you can use ord() for any other not redefined parts of the alphabet (advantage: avoiding possible problems with Unicode strings).

If you want to redefine many of the characters and use only the printable ones you can also define an own alphabet string like follows:

#                                                                                v                    v
alphabet = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%_'()*+,-./:;<=>?@[]^&`{|}~"""

and then use to sort by it:

print(sorted(t, key=lambda s: [alphabet.index(c) for c in s]))

For extended use on a huge number of data to sort consider to turn the alphabet to a dictionary:

dict_alphabet = { alphabet[i]:i for i in range(len(alphabet)) }
print(sorted(t, key=lambda s: [dict_alphabet[c] for c in s ]))

or best use the in Python available character translation feature available for strings:

alphabet = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%_'()*+,-./:;<=>?@[]^&`{|}~"""
table = str.maketrans(alphabet, ''.join(sorted(alphabet)))
print(sorted(t, key=lambda s: s.translate(table)))

By the way: you can get a list of printable Python characters using the string module:

import string
print(string.printable) # includes Form-Feed, Tab, VT, ...

FINALLY

Below ready to use Python code for sorting lists of strings exactly like they would be sorted in R:

Rcode = """
s <- "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!#$%&()*+,-./:;<=>?@[\]^_`{|}~"
paste(sort(unlist(strsplit(s, ""))), collapse = "")"""
RsortOrder = "_-,;:!?.()[]{}@*/\&#%`^+<=>|~$0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ"
# ^--- result of running the R-code online ( [TIO][1] )
# print(''.join(sorted("_-,;:!?.()[]{}@*/\&#%`^+<=>|~$0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ")))
PythonSort = "!#$%&()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
# ===========================================
alphabet = RsortOrder
table = str.maketrans(alphabet, ''.join(sorted(alphabet)))
print(">>>",sorted(["1&2","1_2"], key=lambda s: s.translate(table)))

printing

>>> ['1_2', '1&2']

Run the R-code online using: TIO or generate your own RsortOrder running the provided R-code and using your specific locale setting in R as suggested in the comments to your question by juanpa.arrivillaga .

Alternatively you can use the Python locale module for the purpose of usage of the same locale setting as it is used in R:
( https://stackoverflow.com/questions/1097908/how-do-i-sort-unicode-strings-alphabetically-in-python )

import locale
# this reads the environment and inits the right locale
locale.setlocale(locale.LC_ALL, "")
# locale.strxfrm(string)
# Transforms a string to one that can be used in locale-aware comparisons. 
# For example, strxfrm(s1) < strxfrm(s2) is equivalent to strcoll(s1, s2) < 0. 
# This function can be used when the same string is compared repeatedly, 
# e.g. when collating a sequence of strings.
print("###",sorted(["1&2","1_2"], key=locale.strxfrm))

prints

### ['1_2', '1&2']
Answered By: Claudio

Actually, depending on which sort method you are using in R, Python and R use different collation algorithms. R’s sort is either based on Unicode Collation Algorithm or on a libc locale. Python’s uses libc. R in this instance is more flexible and can be compatible with other languages.

As others have noted, you could set LC_COLLATE to the C locale for both R and Python to get consistent results across languages.

Alternatively, if you have icu4c on your system, and PyICU installed, the following code illustrates the difference in sorting:

t = ["1&2","1_2"]
sorted(t)
# ['1&2', '1_2']

import icu
collator = icu.Collator.createInstance(icu.Locale.getRoot())
sorted(t, key=collator.getSortKey)
# ['1_2', '1&2']

The collator instance is using the root collation (i.e. the CLDR Collation Algorithm, a tailoring of the Unicode Collation Algorithm)

There are many differences between R and Python sort. The obvious one if how upper and lower case are sorted. Using PyICU:

l = ['a', 'Z', 'A']
sorted(l)
# ['A', 'Z', 'a']
sorted(l, key=collator.getSortKey)
# ['a', 'A', 'Z']

In R:

l <- c("a", "Z", "A")
sort(l)
#[1] "a" "A" "Z"

Alternatively, it’s possible to use DUCET (UCA) rather than CLDR’s root collation, they will give the same results in this instance.

from pyuca import Collator as ducetCollator
coll = ducetCollator()
sorted(t, key=coll.sort_key)
['1_2', '1&2']

Although, I would use an updated allkeys file for DUCET.

Answered By: Andj
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.