Is Python's sort function the same as Linux's sort with LC_ALL=C

Question:

I’m porting a Bash script to Python. The script sets LC_ALL=C and uses the Linux sort command to ensure the native byte order instead of locale-specific sort orders (http://stackoverflow.com/questions/28881/why-doesnt-sort-sort-the-same-on-every-machine).

In Python, I want to use Python’s list sort() or sorted() functions (without the key= option). Will I always get the same results as Linux sort with LC_ALL=C?

Asked By: tahoar

||

Answers:

Considering you can add a comparison function, you can make sure that the sort is going to be the equivalent of LC_ALL=C. From the docs, though, it looks like if all the characters are 7bit, then it sorts in this manner by default, otherwise is uses locale specific sorting.

In the case that you have 8bit or Unicode characters, then locale specific sorting makes a lot of sense.

Answered By: Petesh

Non-unicode strings in Python version less than 3 are actually bytes. sort function and methods do not do anything to enforce locale (locale module function is needed to facilitate locale-aware sorting explicitly).

unicode strings and all strings of Python 3.x are no more bytes. There is a “bytes” type in Python 3.

Answered By: Roman Susi

Sorting should behave as you expect if you pass locale.strcoll as the cmp argument to
list.sort() and sorted():

import locale
locale.setlocale(locale.LC_ALL, "C")
yourList.sort(cmp=locale.strcoll)

But in Python 3 (from this answer):

import locale
from functools import cmp_to_key
locale.setlocale(locale.LC_ALL, "C")
yourList.sort(key=cmp_to_key(locale.strcoll))
Answered By: Frédéric Hamidi

I have been using International Components for Unicode, along with the PyICU bindings, to sort things with sorted() and using my own locale (Catalan on my case). For example, ordering a list of user profiles by name property:

collator = PyICU.Collator.createInstance(PyICU.Locale('ca_ES.UTF-8'))
sorted(user_profiles, key=lambda x: x.name, cmp=collator.compare)
Answered By: nabucosound

Yes! to your specific question

Will I always get the same results as Linux sort with LC_ALL=C?

Yes! Python defaults to the C locale, so you can expect the same behavior as linux LC_ALL=C sort.

You can be more explicit about this behavior by setting it yourself and sorting with strxfrm:

locale.setlocale(locale.LC_ALL, 'C')  # same as you do in linux
locale.setlocale(locale.LC_COLLATE, 'C')  # specific to sorting

mylist.sort(key=locale.strxfrm)
# To incorporate locale sorting with other uses of key=,
# wrap locale.strxfrm() around whatever else you're doing:
mylist.sort(key=lambda i: locale.strxfrm( mysortfunc(i) ))

Documentation

From https://docs.python.org/3/library/locale.html​ :

Initially, when a program is started, the locale is the C locale, no matter what the user’s preferred locale is. … The program must explicitly say that it wants the user’s preferred locale settings for other categories by calling setlocale(LC_ALL, '').

According to POSIX, a program which has not called setlocale(LC_ALL, '') runs using the portable 'C' locale. Calling setlocale(LC_ALL, '') lets it use the default locale as defined by the LANG variable. Since we do not want to interfere with the current locale setting we thus emulate the behavior in the way described above.

Example

# What are the settings when Python first starts?
>>> import locale
>>> locale.setlocale(locale.LC_ALL, None)  # If locale is omitted or None, the current setting for category is returned.
'LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C'
                                           # ^^^^^^^^^^^^
>>> locale.getlocale(locale.LC_COLLATE)  # The 'C' setting is equivalent to:
(None, None)

# Set LC_COLLATE & use strcoll/strxfrm to sort according to user's locale
# (like linux sort(1) does by default):
>>> locale.setlocale(locale.LC_COLLATE, '')  # An empty string specifies the user’s default settings.
'en_US.UTF-8'
>>> locale.setlocale(locale.LC_ALL, None)
'LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C'
                                           # ^^^^^^^^^^^^^^^^^^^^^^
>>> mylist.sort(key=locale.strxfrm)
>>> mylist.sort(key=lambda i: locale.strxfrm( mysortfunc(i) ))

# Set LC_ALL (everything) to user's locale (common practice):
>>> locale.setlocale(locale.LC_ALL, '')
'en_US.UTF-8'
>>> locale.setlocale(locale.LC_ALL, None)
'en_US.UTF-8'
>>> locale.getlocale(locale.LC_COLLATE)
('en_US', 'UTF-8')

# Use portable/C locale, including byte-order sorting:
>>> locale.setlocale(locale.LC_ALL, 'C')
'C'
>>> locale.setlocale(locale.LC_ALL, None)
'C'
# The LC_ALL setting overrode our previous LC_COLLATE setting:
>>> locale.setlocale(locale.LC_COLLATE, None)
'C'
>>> locale.getlocale(locale.LC_COLLATE)
(None, None)

Many thanks to Frédéric Hamidi’s answer, which sent me in the right direction to understand this.

Answered By: Jacktose
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.