Python 3 CSV file giving UnicodeDecodeError: 'utf-8' codec can't decode byte error when I print
Question:
I have the following code in Python 3, which is meant to print out each line in a csv file.
import csv
with open('my_file.csv', 'r', newline='') as csvfile:
lines = csv.reader(csvfile, delimiter = ',', quotechar = '|')
for line in lines:
print(' '.join(line))
But when I run it, it gives me this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte
I looked through the csv file, and it turns out that if I take out a single ñ (little n with a tilde on top), every line prints out fine.
My problem is that I’ve looked through a bunch of different solutions to similar problems, but I still have no idea how to fix this, what to decode/encode, etc. Simply taking out the ñ character in the data is NOT an option.
Answers:
with open('my_file.csv', 'r', newline='', encoding='utf-8') as csvfile:
Try opening the file like above
We know the file contains the byte b'x96'
since it is mentioned in the error message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte
Now we can write a little script to find out if there are any encodings where b'x96'
decodes to ñ
:
import pkgutil
import encodings
import os
def all_encodings():
modnames = set([modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases = set(encodings.aliases.aliases.values())
return modnames.union(aliases)
text = b'x96'
for enc in all_encodings():
try:
msg = text.decode(enc)
except Exception:
continue
if msg == 'ñ':
print('Decoding {t} with {enc} is {m}'.format(t=text, enc=enc, m=msg))
which yields
Decoding b'x96' with mac_roman is ñ
Decoding b'x96' with mac_farsi is ñ
Decoding b'x96' with mac_croatian is ñ
Decoding b'x96' with mac_arabic is ñ
Decoding b'x96' with mac_romanian is ñ
Decoding b'x96' with mac_iceland is ñ
Decoding b'x96' with mac_turkish is ñ
Therefore, try changing
with open('my_file.csv', 'r', newline='') as csvfile:
to one of those encodings, such as:
with open('my_file.csv', 'r', encoding='mac_roman', newline='') as csvfile:
For others who hit the same error shown in the subject, watch out for the file encoding of your csv file. Its possible it is not utf-8. I just noticed that LibreOffice created a utf-16 encoded file for me today without prompting me although I could not reproduce this.
If you try to open a utf-16 encoded document using open(... encoding='utf-8')
, you will get the error:
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position
0: invalid start byte
To fix either specify ‘utf-16’ encoding or change the encoding of the csv.
with open('my_file.csv', 'r', newline='', encoding='ISO-8859-1') as csvfile:
ñ character is not listed on UTC-8 encoding. To fix the issue, you may use ISO-8859-1 encoding instead. For more details about this encoding, you may refer to the link below:
https://www.ic.unicamp.br/~stolfi/EXPORT/www/ISO-8859-1-Encoding.html
I also faced the issue with python 3 and my issue got resolved using the encoding type as utf-16
with open('data.csv', newline='',encoding='utf-16') as csvfile:
A much simpler solution is to open the csv file in notepad and select “Save As” in “File” dropdown list. Choose “Save as type” to “All files(.)”. Select “UTF-8 Encoding” in Encoding dropdown list and put “.csv” extension to the file name
easy… just open it in Excel or OpenOffice calc, use text as columns, select ,
, and then just save the file as .csv
… it takes me one day and several hour of search in google… but at the end i figure it out.
Just try UTF 16 for the file that may include characters rather than English ones and that’s why UTF-16 is implemented for it. 8 and 16 implementations are the same unicode, yet the only difference is the bits number so it 16 will handle character like ~
while 8 will not, just a-zA-Z0-9
characters
with open('my_file.csv', 'r', newline='', encoding='UTF-16') as csvfile:
I have the following code in Python 3, which is meant to print out each line in a csv file.
import csv
with open('my_file.csv', 'r', newline='') as csvfile:
lines = csv.reader(csvfile, delimiter = ',', quotechar = '|')
for line in lines:
print(' '.join(line))
But when I run it, it gives me this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte
I looked through the csv file, and it turns out that if I take out a single ñ (little n with a tilde on top), every line prints out fine.
My problem is that I’ve looked through a bunch of different solutions to similar problems, but I still have no idea how to fix this, what to decode/encode, etc. Simply taking out the ñ character in the data is NOT an option.
with open('my_file.csv', 'r', newline='', encoding='utf-8') as csvfile:
Try opening the file like above
We know the file contains the byte b'x96'
since it is mentioned in the error message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte
Now we can write a little script to find out if there are any encodings where b'x96'
decodes to ñ
:
import pkgutil
import encodings
import os
def all_encodings():
modnames = set([modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases = set(encodings.aliases.aliases.values())
return modnames.union(aliases)
text = b'x96'
for enc in all_encodings():
try:
msg = text.decode(enc)
except Exception:
continue
if msg == 'ñ':
print('Decoding {t} with {enc} is {m}'.format(t=text, enc=enc, m=msg))
which yields
Decoding b'x96' with mac_roman is ñ
Decoding b'x96' with mac_farsi is ñ
Decoding b'x96' with mac_croatian is ñ
Decoding b'x96' with mac_arabic is ñ
Decoding b'x96' with mac_romanian is ñ
Decoding b'x96' with mac_iceland is ñ
Decoding b'x96' with mac_turkish is ñ
Therefore, try changing
with open('my_file.csv', 'r', newline='') as csvfile:
to one of those encodings, such as:
with open('my_file.csv', 'r', encoding='mac_roman', newline='') as csvfile:
For others who hit the same error shown in the subject, watch out for the file encoding of your csv file. Its possible it is not utf-8. I just noticed that LibreOffice created a utf-16 encoded file for me today without prompting me although I could not reproduce this.
If you try to open a utf-16 encoded document using open(... encoding='utf-8')
, you will get the error:
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position
0: invalid start byte
To fix either specify ‘utf-16’ encoding or change the encoding of the csv.
with open('my_file.csv', 'r', newline='', encoding='ISO-8859-1') as csvfile:
ñ character is not listed on UTC-8 encoding. To fix the issue, you may use ISO-8859-1 encoding instead. For more details about this encoding, you may refer to the link below:
https://www.ic.unicamp.br/~stolfi/EXPORT/www/ISO-8859-1-Encoding.html
I also faced the issue with python 3 and my issue got resolved using the encoding type as utf-16
with open('data.csv', newline='',encoding='utf-16') as csvfile:
A much simpler solution is to open the csv file in notepad and select “Save As” in “File” dropdown list. Choose “Save as type” to “All files(.)”. Select “UTF-8 Encoding” in Encoding dropdown list and put “.csv” extension to the file name
easy… just open it in Excel or OpenOffice calc, use text as columns, select ,
, and then just save the file as .csv
… it takes me one day and several hour of search in google… but at the end i figure it out.
Just try UTF 16 for the file that may include characters rather than English ones and that’s why UTF-16 is implemented for it. 8 and 16 implementations are the same unicode, yet the only difference is the bits number so it 16 will handle character like ~
while 8 will not, just a-zA-Z0-9
characters
with open('my_file.csv', 'r', newline='', encoding='UTF-16') as csvfile: