How can I change how Python sort deals with punctuation?
Question:
I’m currently trying to rewrite an R script in Python. I’ve been tripped up because it looks like R and Python sort some punctuation differently. Specifically ‘&’ and ‘_’. At some point in my program I sort by an identifier column in a Pandas dataframe.
As an example in Python:
t = ["1&2","1_2"]
sorted(t)
results in
['1&2', '1_2']
Comparatively in R:
t <- c("1&2","1_2")
sort(t)
results in
[1] "1_2" "1&2"
According to various resources (https://www.dconc.gov/home/showpublisheddocument/1481/635548543552170000) Python is doing the correct thing, but unfortunately I need to do the wrong thing here (changing R is not in scope).
Is there a straight forward way that I can change for Python would sort this? Specifically I’ll need to be able to do this on pandas dataframes when sorting by an ID column
Answers:
Use a custom key for sorting. Here, we can just swap the &
and _
. We do the swap by using list comprehension and breaking a string into a list of its individual characters, but we swap the &
and _
characters. Then we rebuild the string with a ''.join
‘.
t = ["1&2","1_2", "5&3"]
def swap_chars(s):
return ''.join([c if
c not in ['&', '_']
else '_' if c == '&'
else '&' for c in s])
sorted(t, key = swap_chars)
You have the option of just skipping all the following text to FINALLY and use the provided code for sorting Python lists of strings like they would be sorted in R or learn a bit about Python reading the answer from top to bottom:
Like already mentioned in the comment to your question by Rawson (giving appropriate helpful link) you can define the order in which sorting should take place for any characters you choose to take out of the usual sorting order:
t = ['1&2', '1_2']
print(sorted(t))
alphabet = {"_":-2, "&":-1}
def sortkey(word):
return [ alphabet.get(chr, ord(chr)) for chr in word ]
# what means:
# return [ alphabet[chr] if chr in alphabet else ord(chr) for chr in word ]
print(sortkey(t[0]), sortkey(t[1]))
print(sorted(t, key=sortkey))
gives:
['1&2', '1_2']
[49, -1, 50] [49, -2, 50]
['1_2', '1&2']
Use negative values to define the alphabet
order so you can use ord()
for any other not redefined parts of the alphabet (advantage: avoiding possible problems with Unicode strings).
If you want to redefine many of the characters and use only the printable ones you can also define an own alphabet string like follows:
# v v
alphabet = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%_'()*+,-./:;<=>?@[]^&`{|}~"""
and then use to sort by it:
print(sorted(t, key=lambda s: [alphabet.index(c) for c in s]))
For extended use on a huge number of data to sort consider to turn the alphabet to a dictionary:
dict_alphabet = { alphabet[i]:i for i in range(len(alphabet)) }
print(sorted(t, key=lambda s: [dict_alphabet[c] for c in s ]))
or best use the in Python available character translation feature available for strings:
alphabet = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%_'()*+,-./:;<=>?@[]^&`{|}~"""
table = str.maketrans(alphabet, ''.join(sorted(alphabet)))
print(sorted(t, key=lambda s: s.translate(table)))
By the way: you can get a list of printable Python characters using the string
module:
import string
print(string.printable) # includes Form-Feed, Tab, VT, ...
FINALLY
Below ready to use Python code for sorting lists of strings exactly like they would be sorted in R:
Rcode = """
s <- "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!#$%&()*+,-./:;<=>?@[\]^_`{|}~"
paste(sort(unlist(strsplit(s, ""))), collapse = "")"""
RsortOrder = "_-,;:!?.()[]{}@*/\&#%`^+<=>|~$0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ"
# ^--- result of running the R-code online ( [TIO][1] )
# print(''.join(sorted("_-,;:!?.()[]{}@*/\&#%`^+<=>|~$0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ")))
PythonSort = "!#$%&()*+,-./0123456789:;<=>[email protected][\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
# ===========================================
alphabet = RsortOrder
table = str.maketrans(alphabet, ''.join(sorted(alphabet)))
print(">>>",sorted(["1&2","1_2"], key=lambda s: s.translate(table)))
printing
>>> ['1_2', '1&2']
Run the R-code online using: TIO or generate your own RsortOrder
running the provided R-code and using your specific locale setting in R as suggested in the comments to your question by juanpa.arrivillaga .
Alternatively you can use the Python locale
module for the purpose of usage of the same locale setting as it is used in R:
( https://stackoverflow.com/questions/1097908/how-do-i-sort-unicode-strings-alphabetically-in-python )
import locale
# this reads the environment and inits the right locale
locale.setlocale(locale.LC_ALL, "")
# locale.strxfrm(string)
# Transforms a string to one that can be used in locale-aware comparisons.
# For example, strxfrm(s1) < strxfrm(s2) is equivalent to strcoll(s1, s2) < 0.
# This function can be used when the same string is compared repeatedly,
# e.g. when collating a sequence of strings.
print("###",sorted(["1&2","1_2"], key=locale.strxfrm))
prints
### ['1_2', '1&2']
Actually, depending on which sort method you are using in R, Python and R use different collation algorithms. R’s sort is either based on Unicode Collation Algorithm or on a libc locale. Python’s uses libc. R in this instance is more flexible and can be compatible with other languages.
As others have noted, you could set LC_COLLATE to the C locale for both R and Python to get consistent results across languages.
Alternatively, if you have icu4c on your system, and PyICU installed, the following code illustrates the difference in sorting:
t = ["1&2","1_2"]
sorted(t)
# ['1&2', '1_2']
import icu
collator = icu.Collator.createInstance(icu.Locale.getRoot())
sorted(t, key=collator.getSortKey)
# ['1_2', '1&2']
The collator instance is using the root collation (i.e. the CLDR Collation Algorithm, a tailoring of the Unicode Collation Algorithm)
There are many differences between R and Python sort. The obvious one if how upper and lower case are sorted. Using PyICU:
l = ['a', 'Z', 'A']
sorted(l)
# ['A', 'Z', 'a']
sorted(l, key=collator.getSortKey)
# ['a', 'A', 'Z']
In R:
l <- c("a", "Z", "A")
sort(l)
#[1] "a" "A" "Z"
Alternatively, it’s possible to use DUCET (UCA) rather than CLDR’s root collation, they will give the same results in this instance.
from pyuca import Collator as ducetCollator
coll = ducetCollator()
sorted(t, key=coll.sort_key)
['1_2', '1&2']
Although, I would use an updated allkeys file for DUCET.
I’m currently trying to rewrite an R script in Python. I’ve been tripped up because it looks like R and Python sort some punctuation differently. Specifically ‘&’ and ‘_’. At some point in my program I sort by an identifier column in a Pandas dataframe.
As an example in Python:
t = ["1&2","1_2"]
sorted(t)
results in
['1&2', '1_2']
Comparatively in R:
t <- c("1&2","1_2")
sort(t)
results in
[1] "1_2" "1&2"
According to various resources (https://www.dconc.gov/home/showpublisheddocument/1481/635548543552170000) Python is doing the correct thing, but unfortunately I need to do the wrong thing here (changing R is not in scope).
Is there a straight forward way that I can change for Python would sort this? Specifically I’ll need to be able to do this on pandas dataframes when sorting by an ID column
Use a custom key for sorting. Here, we can just swap the &
and _
. We do the swap by using list comprehension and breaking a string into a list of its individual characters, but we swap the &
and _
characters. Then we rebuild the string with a ''.join
‘.
t = ["1&2","1_2", "5&3"]
def swap_chars(s):
return ''.join([c if
c not in ['&', '_']
else '_' if c == '&'
else '&' for c in s])
sorted(t, key = swap_chars)
You have the option of just skipping all the following text to FINALLY and use the provided code for sorting Python lists of strings like they would be sorted in R or learn a bit about Python reading the answer from top to bottom:
Like already mentioned in the comment to your question by Rawson (giving appropriate helpful link) you can define the order in which sorting should take place for any characters you choose to take out of the usual sorting order:
t = ['1&2', '1_2']
print(sorted(t))
alphabet = {"_":-2, "&":-1}
def sortkey(word):
return [ alphabet.get(chr, ord(chr)) for chr in word ]
# what means:
# return [ alphabet[chr] if chr in alphabet else ord(chr) for chr in word ]
print(sortkey(t[0]), sortkey(t[1]))
print(sorted(t, key=sortkey))
gives:
['1&2', '1_2']
[49, -1, 50] [49, -2, 50]
['1_2', '1&2']
Use negative values to define the alphabet
order so you can use ord()
for any other not redefined parts of the alphabet (advantage: avoiding possible problems with Unicode strings).
If you want to redefine many of the characters and use only the printable ones you can also define an own alphabet string like follows:
# v v
alphabet = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%_'()*+,-./:;<=>?@[]^&`{|}~"""
and then use to sort by it:
print(sorted(t, key=lambda s: [alphabet.index(c) for c in s]))
For extended use on a huge number of data to sort consider to turn the alphabet to a dictionary:
dict_alphabet = { alphabet[i]:i for i in range(len(alphabet)) }
print(sorted(t, key=lambda s: [dict_alphabet[c] for c in s ]))
or best use the in Python available character translation feature available for strings:
alphabet = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%_'()*+,-./:;<=>?@[]^&`{|}~"""
table = str.maketrans(alphabet, ''.join(sorted(alphabet)))
print(sorted(t, key=lambda s: s.translate(table)))
By the way: you can get a list of printable Python characters using the string
module:
import string
print(string.printable) # includes Form-Feed, Tab, VT, ...
FINALLY
Below ready to use Python code for sorting lists of strings exactly like they would be sorted in R:
Rcode = """
s <- "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!#$%&()*+,-./:;<=>?@[\]^_`{|}~"
paste(sort(unlist(strsplit(s, ""))), collapse = "")"""
RsortOrder = "_-,;:!?.()[]{}@*/\&#%`^+<=>|~$0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ"
# ^--- result of running the R-code online ( [TIO][1] )
# print(''.join(sorted("_-,;:!?.()[]{}@*/\&#%`^+<=>|~$0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ")))
PythonSort = "!#$%&()*+,-./0123456789:;<=>[email protected][\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
# ===========================================
alphabet = RsortOrder
table = str.maketrans(alphabet, ''.join(sorted(alphabet)))
print(">>>",sorted(["1&2","1_2"], key=lambda s: s.translate(table)))
printing
>>> ['1_2', '1&2']
Run the R-code online using: TIO or generate your own RsortOrder
running the provided R-code and using your specific locale setting in R as suggested in the comments to your question by juanpa.arrivillaga .
Alternatively you can use the Python locale
module for the purpose of usage of the same locale setting as it is used in R:
( https://stackoverflow.com/questions/1097908/how-do-i-sort-unicode-strings-alphabetically-in-python )
import locale
# this reads the environment and inits the right locale
locale.setlocale(locale.LC_ALL, "")
# locale.strxfrm(string)
# Transforms a string to one that can be used in locale-aware comparisons.
# For example, strxfrm(s1) < strxfrm(s2) is equivalent to strcoll(s1, s2) < 0.
# This function can be used when the same string is compared repeatedly,
# e.g. when collating a sequence of strings.
print("###",sorted(["1&2","1_2"], key=locale.strxfrm))
prints
### ['1_2', '1&2']
Actually, depending on which sort method you are using in R, Python and R use different collation algorithms. R’s sort is either based on Unicode Collation Algorithm or on a libc locale. Python’s uses libc. R in this instance is more flexible and can be compatible with other languages.
As others have noted, you could set LC_COLLATE to the C locale for both R and Python to get consistent results across languages.
Alternatively, if you have icu4c on your system, and PyICU installed, the following code illustrates the difference in sorting:
t = ["1&2","1_2"]
sorted(t)
# ['1&2', '1_2']
import icu
collator = icu.Collator.createInstance(icu.Locale.getRoot())
sorted(t, key=collator.getSortKey)
# ['1_2', '1&2']
The collator instance is using the root collation (i.e. the CLDR Collation Algorithm, a tailoring of the Unicode Collation Algorithm)
There are many differences between R and Python sort. The obvious one if how upper and lower case are sorted. Using PyICU:
l = ['a', 'Z', 'A']
sorted(l)
# ['A', 'Z', 'a']
sorted(l, key=collator.getSortKey)
# ['a', 'A', 'Z']
In R:
l <- c("a", "Z", "A")
sort(l)
#[1] "a" "A" "Z"
Alternatively, it’s possible to use DUCET (UCA) rather than CLDR’s root collation, they will give the same results in this instance.
from pyuca import Collator as ducetCollator
coll = ducetCollator()
sorted(t, key=coll.sort_key)
['1_2', '1&2']
Although, I would use an updated allkeys file for DUCET.