How do I compare a Unicode string that has different bytes, but the same value?

Question:

I’m comparing Unicode strings between JSON objects.

They have the same value:

a = '人口じんこうに膾炙かいしゃする'
b = '人口じんこうに膾炙かいしゃする'

But they have different Unicode representations:

String a : u'u4ebau53e3u3058u3093u3053u3046u306bu81beu7099u304bu3044u3057u3083u3059u308b'
String b : u'u4ebau53e3u3058u3093u3053u3046u306bu81beuf9fbu304bu3044u3057u3083u3059u308b'

How can I compare between two Unicode strings on their value?

Asked By: Seunghoon Baek

||

Answers:

Unicode normalization will get you there for this one:

>>> import unicodedata
>>> unicodedata.normalize("NFC", "uf9fb") == "u7099"
True

Use unicodedata.normalize on both of your strings before comparing them with == to check for canonical Unicode equivalence.

Character U+F9FB is a “CJK Compatibility” character. These characters decompose into one or more regular CJK characters when normalized.

Answered By: Ry-

Character U+F9FB (炙) is a CJK Compatibility Ideograph. These characters are distinct code points from the regular CJK characters, but they decompose into one or more regular CJK characters when normalized.

Unicode has an official string collation algorithm called UCA designed for exactly this purpose. Python does not come with UCA support as of 3.7,* but there are third-party libraries like pyuca:

>>> from pyuca import Collator
>>> ck = Collator().sort_key
>>> ck(a) == ck(b)
True

For this case—and many others, but definitely not all—picking the appropriate normalization to apply to both strings before comparing will work, and it has the advantage of support built into the stdlib.

* The idea has been accepted in principle since 3.4, but nobody has written an implementation—in part because most of the core devs who care are using pyuca or one of the two ICU bindings, which have the advantage of working in current and older versions of Python.

Answered By: abarnert

I would have used PyICU and its Collator class. But first, you should think at what level of Unicode collation algorithm you want the equality to happen.

#!/usr/bin/python
# -*- coding: utf-8 -*-

from icu import Collator

coll = Collator.createInstance()
coll.setStrength(Collator.IDENTICAL)

a = u'人口じんこうに膾炙かいしゃする'
b = u'人口じんこうに膾炙かいしゃする'
print repr(a)
print repr(b)
print ('%s == %s : %s' % (a, b, coll.equals(a,b)))

a = u'エレベーター'
b = u'エレベーター'
print ('%s == %s : %s' % (a, b, coll.equals(a,b)))

coll.setStrength(Collator.PRIMARY)
print ('%s == %s : %s' % (a, b, coll.equals(a,b)))

a = u'hello'
b = u'HELLO'
coll.setStrength(Collator.PRIMARY)
print ('%s == %s : %s' % (a, b, coll.equals(a,b)))

coll.setStrength(Collator.TERTIARY)
print ('%s == %s : %s' % (a, b, coll.equals(a,b)))

This outputs:

u'u4ebau53e3u3058u3093u3053u3046u306bu81beu7099u304bu3044u3057u3083u3059u308b'
u'u4ebau53e3u3058u3093u3053u3046u306bu81beuf9fbu304bu3044u3057u3083u3059u308b'
人口じんこうに膾炙かいしゃする == 人口じんこうに膾炙かいしゃする : True
エレベーター == エレベーター : False
エレベーター == エレベーター : True
hello == HELLO : True
hello == HELLO : False
Answered By: wilx
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.