UTF-8 to ISO-8859-1 encoding: replace special characters with closest equivalent
Question:
Does anyone know of Python libraries that allows you to convert a UTF-8 string to ISO-8859-1 encoding in a smart way?
By smart, I mean replacing characters like “–” by “-” or so. And for the many characters for which an equivalent really cannot be thought of, replace with “?” (like encode('iso-8859-1', errors='replace')
does).
Answers:
Well I am not aware of any existing library, but Unidecode has a GPL 2 license meaning that it can be used as a base for another program. Its main function has a special case processing for all ASCII code point (below 128) keeping them untouched. If you just extend that processing to Latin1 letters (code points below 256) you will get a special version that keeps Latin1 chars and uses unidecode for all other characters.
As I know no character beyond 255 that should be mapped to a latin1 non ascii character, that should do the trick.
libiconv has a “TRANSLIT” feature which does what you want
Since the first 256 code points of Unicode match ISO-8859-1, it is possible to attempt encoding to ISO-8859-1, which will take care of all characters 0 to 255 without errors. For the characters leading to encoding errors, unidecode can be used.
The following works on Python 2 and 3:
from builtins import str
import unidecode
def unidecode_fallback(e):
part = e.object[e.start:e.end]
replacement = str(unidecode.unidecode(part) or '?')
return (replacement, e.start + len(part))
codecs.register_error('unidecode_fallback', unidecode_fallback)
s = u'abcdé–fgh ijkl'.encode('iso-8859-1', errors='unidecode_fallback')
print(s.decode('iso-8859-1'))
Result:
abcdé-fgh?ijkl
This however converts non-ISO-8859-1 characters into an ASCII equivalent, while sometimes it may be better to have a non-ASCII, ISO-8859-1 equivalent.
Works with TKinter and export to Excel:
def utf_2_iso(text: str) -> str:
try:
result: str = text.encode('ISO-8859-1', errors='ignore').decode('utf-8', errors='ignore')
except UnicodeError:
result = text
return result
Does anyone know of Python libraries that allows you to convert a UTF-8 string to ISO-8859-1 encoding in a smart way?
By smart, I mean replacing characters like “–” by “-” or so. And for the many characters for which an equivalent really cannot be thought of, replace with “?” (like encode('iso-8859-1', errors='replace')
does).
Well I am not aware of any existing library, but Unidecode has a GPL 2 license meaning that it can be used as a base for another program. Its main function has a special case processing for all ASCII code point (below 128) keeping them untouched. If you just extend that processing to Latin1 letters (code points below 256) you will get a special version that keeps Latin1 chars and uses unidecode for all other characters.
As I know no character beyond 255 that should be mapped to a latin1 non ascii character, that should do the trick.
libiconv has a “TRANSLIT” feature which does what you want
Since the first 256 code points of Unicode match ISO-8859-1, it is possible to attempt encoding to ISO-8859-1, which will take care of all characters 0 to 255 without errors. For the characters leading to encoding errors, unidecode can be used.
The following works on Python 2 and 3:
from builtins import str
import unidecode
def unidecode_fallback(e):
part = e.object[e.start:e.end]
replacement = str(unidecode.unidecode(part) or '?')
return (replacement, e.start + len(part))
codecs.register_error('unidecode_fallback', unidecode_fallback)
s = u'abcdé–fgh ijkl'.encode('iso-8859-1', errors='unidecode_fallback')
print(s.decode('iso-8859-1'))
Result:
abcdé-fgh?ijkl
This however converts non-ISO-8859-1 characters into an ASCII equivalent, while sometimes it may be better to have a non-ASCII, ISO-8859-1 equivalent.
Works with TKinter and export to Excel:
def utf_2_iso(text: str) -> str:
try:
result: str = text.encode('ISO-8859-1', errors='ignore').decode('utf-8', errors='ignore')
except UnicodeError:
result = text
return result