Unicode error handling with Python 3's readlines()
Question:
I keep getting this error while reading a text file. Is it possible to handle/ignore it and proceed?
UnicodeEncodeError: ‘charmap’ codec can’t decode byte 0x81 in position
7827: character maps to undefined.
Answers:
Yeah..you could wrap it in a
try:
....
except UnicodeEncodeError:
pass
In Python 3, pass an appropriate errors=
value (such as errors=ignore
or errors=replace
) on creating your file object (presuming it to be a subclass of io.TextIOWrapper
— and if it isn’t, consider wrapping it in one!); also, consider passing a more likely encoding than charmap
(when you aren’t sure, utf-8
is always a good place to start).
For instance:
f = open('misc-notes.txt', encoding='utf-8', errors='ignore')
In Python 2, the read()
operation simply returns bytes; the trick, then, is decoding them to get them into a string (if you do, in fact, want characters as opposed to bytes). If you don’t have a better guess for their real encoding:
your_string.decode('utf-8', 'replace')
…to replace unhandled characters, or
your_string.decode('utf-8', 'ignore')
to simply ignore them.
That said, finding and using their real encoding (rather than guessing utf-8
) would be preferred.
You should open the file with a codecs to make sure that the file gets interpreted as UTF8.
import codecs
fd = codecs.open(filename,'r',encoding='utf-8')
data = fd.read()
I keep getting this error while reading a text file. Is it possible to handle/ignore it and proceed?
UnicodeEncodeError: ‘charmap’ codec can’t decode byte 0x81 in position
7827: character maps to undefined.
Yeah..you could wrap it in a
try:
....
except UnicodeEncodeError:
pass
In Python 3, pass an appropriate errors=
value (such as errors=ignore
or errors=replace
) on creating your file object (presuming it to be a subclass of io.TextIOWrapper
— and if it isn’t, consider wrapping it in one!); also, consider passing a more likely encoding than charmap
(when you aren’t sure, utf-8
is always a good place to start).
For instance:
f = open('misc-notes.txt', encoding='utf-8', errors='ignore')
In Python 2, the read()
operation simply returns bytes; the trick, then, is decoding them to get them into a string (if you do, in fact, want characters as opposed to bytes). If you don’t have a better guess for their real encoding:
your_string.decode('utf-8', 'replace')
…to replace unhandled characters, or
your_string.decode('utf-8', 'ignore')
to simply ignore them.
That said, finding and using their real encoding (rather than guessing utf-8
) would be preferred.
You should open the file with a codecs to make sure that the file gets interpreted as UTF8.
import codecs fd = codecs.open(filename,'r',encoding='utf-8') data = fd.read()