Same word but different unicode characters

Question:

I built a question answering application in terms of restaurants in Vietnam using Python on Windows. To write Vietnamese characters I need to use Unicode.
First, I clone data from TripAdvisor website which used HTML charset=utf-8 and build my Mongo database. A city named “đà nẵng” in TripAdvisor has a code:

>>> print repr("đà nẵng")     # from tripadvisor website 
>>> 'xc4x91axccx80 nxc4x83xccx83ng'

However, when I query from Firefox’s address bar, the city “đà nẵng” has a different code:

>>> print repr("đà nẵng")   # Firefox's address bar
>>> 'xc4x91xc3xa0 nxe1xbaxb5ng'

That is a reason why I can not find that city in my database. I try to write this city name on notepad++ and got the same result as using Firefox’s address bar

>>> print repr("đà nẵng")   # notepad++ using 'Encoding UTF-8'
>>> 'xc4x91xc3xa0 nxe1xbaxb5ng'

Is there any way to convert between two types of code?
Or is there any way to match the city name “đà nẵng” with different codes in this case?.

Asked By: Phong Khac Do

||

Answers:

The problem you encounter is, that unicode allows multiple ways to compose the same symbol. The Python module unicodedata provides a function normalize that allows you to convert unicode representations to a fixed form (e.g. NFC)

from unicodedata import normalize

S1 = b'xc4x83xccx83'.decode('UTF-8')
S2 = b'xe1xbaxb5'.decode('UTF-8')

print(normalize('NFC', S1).encode('UTF-8'))
print(normalize('NFC', S2).encode('UTF-8'))

In your example tripadvisor displayed in NFD form, while notepad used NFC.

Answered By: Jul3k

Although this is an old question, I will add an answer for anyone who stumbles across this question.

The difference between the two byte sequences is a case of canonical equivalence. Some characters can be represented by more than one byte sequence. For the letter ẵ, there are five canonically equivalent representations possible. In python, it is possible to use pyicu to get a list of all the canonical equivalences of a particular string.

Two of the five equate to the normalised forms using NFC and NFD. But in the example in this question, Firefox uses NFC for nẵng, the byte sequence is b’nxe1xbaxb5ng’.

But the byte sequence given from trip advisor is b’nxc4x83xccx83ng`. This is not NFC, nor is it NFD. The NFD equivalent would be b’naxccx86xccx83ng’.

The string in trip Advisor was typed using Window’s Vietnamese keyboard. When Microsoft first implemented a Vietnamese Unicode keyboard they followed the character model of Windows-1252, vowels with a circumflex, breve or hook were single precomposed characters, all tone markers were represented by combining diacritics, so resultant text at the time was a mix of precomposed and decomposed sequences. Neither NFC or NFD.

So in this instance, ‘nẵng’ was:

n 006E LATIN SMALL LETTER N
ă 0103 LATIN SMALL LETTER A WITH BREVE
◌̃ 0303 COMBINING TILDE
n 006E LATIN SMALL LETTER N
g 0067 LATIN SMALL LETTER G
>>> s = 'nẵng'
>>> s.encode('UTF-8')
b'nxc4x83xccx83ng'
>>> from unicodedata import normalize
>>> normalize("NFC", s).encode('UTF-8')
b'nxe1xbaxb5ng'
>>> normalize("NFD", s).encode('UTF-8')
b'naxccx86xccx83ng'
Answered By: Andj
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.