UTF-8 to ISO-8859-1 encoding: replace special characters with closest equivalent

Question:

Does anyone know of Python libraries that allows you to convert a UTF-8 string to ISO-8859-1 encoding in a smart way?

By smart, I mean replacing characters like “–” by “-” or so. And for the many characters for which an equivalent really cannot be thought of, replace with “?” (like encode('iso-8859-1', errors='replace') does).

Asked By: mimo

||

Answers:

Well I am not aware of any existing library, but Unidecode has a GPL 2 license meaning that it can be used as a base for another program. Its main function has a special case processing for all ASCII code point (below 128) keeping them untouched. If you just extend that processing to Latin1 letters (code points below 256) you will get a special version that keeps Latin1 chars and uses unidecode for all other characters.

As I know no character beyond 255 that should be mapped to a latin1 non ascii character, that should do the trick.

Answered By: Serge Ballesta

libiconv has a “TRANSLIT” feature which does what you want

Answered By: JoelFan

Since the first 256 code points of Unicode match ISO-8859-1, it is possible to attempt encoding to ISO-8859-1, which will take care of all characters 0 to 255 without errors. For the characters leading to encoding errors, unidecode can be used.

The following works on Python 2 and 3:

from builtins import str
import unidecode

def unidecode_fallback(e):
    part = e.object[e.start:e.end]
    replacement = str(unidecode.unidecode(part) or '?')
    return (replacement, e.start + len(part))

codecs.register_error('unidecode_fallback', unidecode_fallback)

s = u'abcdé–fgh ijkl'.encode('iso-8859-1', errors='unidecode_fallback')
print(s.decode('iso-8859-1'))

Result:

abcdé-fgh?ijkl

This however converts non-ISO-8859-1 characters into an ASCII equivalent, while sometimes it may be better to have a non-ASCII, ISO-8859-1 equivalent.

Answered By: mimo

Works with TKinter and export to Excel:

def utf_2_iso(text: str) -> str:

try:
    result: str = text.encode('ISO-8859-1', errors='ignore').decode('utf-8', errors='ignore')

except UnicodeError:

    result = text

return result
Answered By: Joe
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.