Can you safely read utf8 and latin1 files with a naïve try-except block?

Question:

I believe that any valid latin1 character will either be interpreted correctly by Python’s utf8 encoder or throw an error. I, therefore, claim that if you work with only either utf8 files or latin1 files, you can safely write the following code to read those files, without ending up with Mojibake:

from pathlib import Path

def read_utf8_or_latin1_text(path: Path, args, kwargs):
    try:
        return path.read_text(encoding="utf-8")
    except UnicodeDecodeError:
        return path.read_text(encoding="latin1")

I tested this hypothesis out on this large character data set and found that it holds up to scrutiny. Is this always the case?

Input:

import requests

insanely_many_characters = requests.get(
    "https://github.com/bits/UTF-8-Unicode-Test-Documents/raw/master/UTF-8_sequence_unseparated/utf8_sequence_0-0x10ffff_including-unassigned_including-unprintable-asis_unseparated.txt"
).text


print(
    f"n=== test {len(insanely_many_characters)} utf-8 characters for same-same misinterpretations ==="
)
for char in insanely_many_characters:
    if (x := char.encode("utf-8").decode("utf-8")) != char:
        print(char, x)


print(
    f"n=== test {len(insanely_many_characters)} latin1 characters for same-same misinterpretations ==="
)
latinable = []
nr = 0
for char in insanely_many_characters:
    try:
        if (x := char.encode("latin1").decode("latin1")) != char:
            print(char, x)
        latinable.append(char)
    except UnicodeEncodeError:
        nr += 1
if nr:
    print(f"{nr} characters not in latin1 set")

print('found the following valid latin1 characters: """n' + "".join(latinable) + 'n"""')

print(
    f"n=== test {len(latinable)} latin1 characters for utf-8 Mojibake ==="
)
for char in latinable:
    try:
        if (x := char.encode("latin1").decode("utf-8")) != char:
            print(char, x)
    except UnicodeDecodeError:
        pass

Output:


=== test 1111998 utf-8 characters for same-same misinterpretations ===

=== test 1111998 latin1 characters for same-same misinterpretations ===
1111742 characters not in latin1 set
found the following latin1 characters: """
    
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
"""

=== test 256 latin1 characters for utf-8 Mojibake ===

Addendum:

I see I totally forgot to test for sequences of latin1 characters, and only tested for individual characters. By adding this test:

print(
    f"n=== test {len(latinable)} latin1 sequences wrongly interoperable by utf-8 ==="
)
for char1 in latinable:
    for char2 in latinable:
        try:
            if (x := (char1 + char2).encode("latin1").decode("utf-8")) != char1 + char2:
                print(char1 + char2, x)
        except UnicodeDecodeError:
            pass

I ended up generating many utf-8 Mojibake (1920 instances in total), which is a counterexample to my hypothesis!:

=== test 256 latin1 sequences wrongly interoperable by utf-8 ===
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
   
¡ ¡

 ⋮
Asked By: Simon Streicher

||

Answers:

Theoretically, there are valid Latin-1 sequences which are also valid UTF-8 sequences. In practice, these are very unlikely in any real-world textual data.

You can easily create a table of the possible combinations; here is a sample for tasting.

>>> for x in "u00a0u00a1u00a2u00ffu1234U00012345":
...   print(x.encode('utf-8').decode('latin-1'))
... 
 
¡
¢
ÿ
á´
ð

(There are some whitespace characters after ð.)

Answered By: tripleee

You are incorrect. The latin-1 encoding is a 1-1 mapping for every byte. It is wholly possible for validly encoded latin-1 to, by coincidence, contain bytes that decode to a different UTF-8 character. The odds of such a thing occurring decrease the more non-ASCII characters are present in the data, but it can occur.

Your test isn’t finding these cases, because, by definition, it would take at least two characters encoded to latin-1 to produce a valid, but mismatched, encoding of a single character in UTF-8 (non-ASCII encoded to UTF-8 is always 2-4 bytes in length, never just one byte; ASCII encodes the same in both latin-1 and UTF-8). Since you only tested encoding a single character a latin-1, it’s impossible for it to produce a legal UTF-8 representation, but many pairs (or triples, or quads) of latin-1 characters above the ASCII range will coincidentally produce legal UTF-8 bytes. They may make for total garbage, but they’ll decode validly.

Answered By: ShadowRanger

Your initial assumption is right: if a file that can only be utf-8 of latin1 encoded cannot be read as utf-8, then it is latin1 encoded. The fact is that any sequence of bytes can be decoded as latin1 because there the latin1 encoding is a bijection between the 256 possible bytes and the unicode characters with code point in the [0;256[ range.

But your test if quite different. You load a valid utf-8 encoded file into unicode characters and then test which unicode characters exist in latin1 and find that only the 256 first are.

Said differently the question and the code areonly loosely related…

Answered By: Serge Ballesta
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.