How to combine two code points to get one?

Question:

I know that unicode code point for Á is U+00C1. I read on internet and many forums and articles that I can also make an Á by combining characters ´ (unicode: U+00B4) and A (unicode: U+0041).

My question is simple. How to do it? I tried something like this. I decided to try it in golang, but it’s perfectly fine if someone knows how to do it in python (or some other programming language). It doesn’t matter to me.

Okay, so I tried next.

A in binary is: 01000001

´ in binary is: 10110100

It together takes 15 bits, so I need UTF-8 3 bytes format (1110xxxx 10xxxxxx 10xxxxxx)

By filling the bits from A and ´ (first A) in the places of x, the following is obtained: 11100100 10000110 10110100.

Then I converted the resulting three bytes back into hexadecimal values: E4 86 B4.

However, when I tried to write it in code, I got a completely different character. In other words, my solution is not working as I expected.

package main

import (
    "fmt"
)

func main() {
    r := "xE4x86xB4"

    fmt.Println(r) // It wrote 䆴 instead of Á
}

Asked By: Bruzzi El Muerte

||

Answers:

It looks like the ´ (U+00B4) character you provided is not actually a combining character as Unicode defines it.

>>> "Au00b4"
'A´'

If we use ◌́ (U+0301) instead, then we can just place it in sequence with a character like A and get the expected output:

>>> "Au0301"
'Á'

Unless I’m misunderstanding what you mean, it doesn’t look like any binary manipulation or trickery is necessary here.

Answered By: StardustGogeta

As StardustGogeta explains in their answer, the correct combining unicode character for an "acute" accent is U+0301 (Combining Acute Accent).

But in Go, a string consisting of the single U+00C1 (Latin Capital Letter A with Acute) character is not equal to a string consisting of a U+0041 (Latin Capital Letter A) followed by a U+0301 (Combining Acute Accent)

If you want to compare strings, you need to normalise both to the same normalisation form. See blog post Text normalization in Go for more details.

The following code snippet shows how to do that:

package main

import (
    "fmt"

    "golang.org/x/text/unicode/norm"
)

func main() {
    combined := "u00c1"
    combining := "Au0301"
    fmt.Printf("combined = %s, combining = %sn", combined, combining)
    fmt.Printf("combined == combining: %tn", combined == combining)
    combiningNormalised := string(norm.NFC.Bytes([]byte(combining)))
    fmt.Printf("combined == combiningNormalised: %tn", combined == combiningNormalised)
}

Output:

combined = Á, combining = Á
combined == combining: false
combined == combiningNormalised: true
Answered By: Erwin Bolwidt
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.