icu: Sort strings based on 2 different locales

Question:

As you probably know, the order of alphabet in some (maybe most) languages is different than their order in Unicode. That’s why we may want to use icu.Collator to sort, like this Python example:

from icu import Collator, Locale
collator = Collator.createInstance(Locale("fa_IR.UTF-8"))
mylist.sort(key=collator.getSortKey)

This works perfectly for Persian strings. But it also sorts all Persian strings before all ASCII / English strings (which is the opposite of Unicode sort).

What if we want to sort ASCII before this given locale?

Or ideally, I want to sort by 2 or multiple locales. (For example give multiple Locale arguments to Collator.createInstance)

If we could tell collator.getSortKey to return empty bytes for other locales, then I could create a tuple of 2 collator.getSortKey() results, for example:

from icu import Collator, Locale

collator1 = Collator.createInstance(Locale("en_US.UTF-8"))
collator2 = Collator.createInstance(Locale("fa_IR.UTF-8"))

def sortKey(s):
    return collator1.getSortKey(s), collator2.getSortKey(s)

mylist.sort(key=sortKey)

But looks like getSortKey always returns non-empty bytes.

Asked By: saeedgnu

||

Answers:

For ASCII-before-locale sorting, you can just check whether the string is ASCII:

def sortKey(s):
    """ASCII strings first"""
    return (not s.isascii()), collator.getSortKey(s)

For multiple languages, it is ambiguous for icu, e.x. is "Dobrý večer" string Czech or Slovak? Also, there are a lot of languages that have multiple ASCII-only words.


For python<3.7, use:

def is_not_ascii(s):
    return any(ord(c) > 128 for c in s)
Answered By: Yevhen Kuzmovych

It’s not possible to tell collator.getSortKey() to return empty bytes for other locales, but you can control the sorting behavior using a function that returns a tuple of the desired sort keys in the desired order.

def sort_key(s):
    return (collator1.getSortKey(s), collator2.getSortKey(s)) if s.isascii() else (collator2.getSortKey(s), collator1.getSortKey(s))

mylist.sort(key=sort_key)
Answered By: Mikkel

Sorry for the vague question and thanks for the answers.

Here is the solution I have chosen:

enSortKey = Collator.createInstance(Locale("en_US.UTF-8")).getSortKey
faSortKey = Collator.createInstance(Locale("fa_IR.UTF-8")).getSortKey


def sortKey(pair: "Tuple[List[str], str]"):
    head = pair[0][0].strip()

    ws = getWritingSystemFromText(head, True)
    if ws and ws.name == "Arabic":
        return 1, faSortKey(head)

    return 0, enSortKey(head.lower().lstrip("'-"))

The function getWritingSystemFromText detects the name of script or writing system (Latin, Arabic, Cyrillic, CJK, etc). I have had already implemented this, but didn’t think to use it for sorting.

I believe this would be the most flexible and standard approach.

Answered By: saeedgnu

A bit late to answer the question, but here it is for future reference.

ICU collation uses the CLDR Collation Algorithm, which is a tailoring of the Unicode Collation Algorithm. The default collation is referred to as the root collation. Don’t think in terms of Locales having a set of collation rules, think more in terms of locales specify any differences between the collation rules that the locale needs and the root collation. CLDR takes a minimalist approach, you only need to include the minimal set of differences needed based on the root collation.

English uses the root locale. No tailorings. Persian on the other hand has a few rules needed to override certain aspects of the root collation.

As the question indicates, the Persian collation rules order Arabic characters before Latin characters. In the collation rule set for Persian there is a rule [reorder Arab]. This rule is what you need to override.

There are a few ways to do this:

  1. Use icu.RuleBasedCollator with a coustom set fo rules for Persian.
  2. Create a standard Persian collation, retrieve the rules, strip out the reorder directive and then use modified rules with icu.RuleBasedCollator.
  3. Create collator instance using a BCP-47 language tag, instead of a Locale identifier

There are other approaches as well, but the third is the simplest:

loc = Locale.forLanguageTag("fa-u-kr-latn-arab")
collator = Collator.createInstance(loc)
sorted(mylist, key=collator.getSortKey)

This will reorder the Persian collation rules, placing Latin script before Arabic script, then everything else afterwards.

Answered By: Andj
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.