'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte
Question:
I try to read and print the following file: txt.tsv (https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2017q3_notes.zip)
According to the SEC the data set is provided in a single encoding, as follows:
Tab Delimited Value (.txt): utf-8, tab-delimited, n- terminated lines, with the first line containing the field names in lowercase.
My current code:
import csv
with open('txt.tsv') as tsvfile:
reader = csv.DictReader(tsvfile, dialect='excel-tab')
for row in reader:
print(row)
All attempts ended with the following error message:
‘utf-8’ codec can’t decode byte 0xa0 in position 4276: invalid start byte
I am a bit lost. Can anyone help me?
Answers:
Encoding in the file is ‘windows-1252’. Use:
open('txt.tsv', encoding='windows-1252')
If someone works on Turkish data, then I suggest this line:
df = pd.read_csv("text.txt",encoding='windows-1254')
i have the same error message for .csv file, and This Worked for me :
df = pd.read_csv('Text.csv',encoding='ANSI')
ds = pd.read_csv('/Dataset/test.csv', encoding='windows-1252')
Works fine for me, thanks.
If the input has a stray 'xa0'
, then it’s not in UTF-8, full stop.
Yes, you have to either recode it to UTF-8 (see: iconv
, recode
commands, or a lot of text editors and IDEs can do it), or read it using an 8-bit encoding (as all the other answers suggest).
What you should ask yourself is – what is this character after all (0xa0
or 160)?
Well, in many 8-bit encodings it’s a non-breaking space (like
in HTML). For at least one DOS encoding it’s an accented "a" character. That’s why you need to look at the result of decoding it from the 8-bit encoding.
BTW, sometimes people say "UTF-8", and they mean "mostly ASCII, I guess". And if it was a non-breaking space, they weren’t that far:
In [1]: 'xa0'.encode()
Out[1]: b'xc2xa0'
One exptra preceeding 'xc2'
byte would do the trick.
I also encountered the same issue and worked while using latin1 encoding, refer to the sample code to apply in your codebase. Give a try if above resolution doesn’t work.
df=pd.read_csv("../CSV_FILE.csv",na_values=missing, encoding='latin1')
I was able to open a csv file that gave me that answer, recoding the file by opening it in a notepad and saving it in utf-8, there it was able to open later without problems
I try to read and print the following file: txt.tsv (https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2017q3_notes.zip)
According to the SEC the data set is provided in a single encoding, as follows:
Tab Delimited Value (.txt): utf-8, tab-delimited, n- terminated lines, with the first line containing the field names in lowercase.
My current code:
import csv
with open('txt.tsv') as tsvfile:
reader = csv.DictReader(tsvfile, dialect='excel-tab')
for row in reader:
print(row)
All attempts ended with the following error message:
‘utf-8’ codec can’t decode byte 0xa0 in position 4276: invalid start byte
I am a bit lost. Can anyone help me?
Encoding in the file is ‘windows-1252’. Use:
open('txt.tsv', encoding='windows-1252')
If someone works on Turkish data, then I suggest this line:
df = pd.read_csv("text.txt",encoding='windows-1254')
i have the same error message for .csv file, and This Worked for me :
df = pd.read_csv('Text.csv',encoding='ANSI')
ds = pd.read_csv('/Dataset/test.csv', encoding='windows-1252')
Works fine for me, thanks.
If the input has a stray 'xa0'
, then it’s not in UTF-8, full stop.
Yes, you have to either recode it to UTF-8 (see: iconv
, recode
commands, or a lot of text editors and IDEs can do it), or read it using an 8-bit encoding (as all the other answers suggest).
What you should ask yourself is – what is this character after all (0xa0
or 160)?
Well, in many 8-bit encodings it’s a non-breaking space (like
in HTML). For at least one DOS encoding it’s an accented "a" character. That’s why you need to look at the result of decoding it from the 8-bit encoding.
BTW, sometimes people say "UTF-8", and they mean "mostly ASCII, I guess". And if it was a non-breaking space, they weren’t that far:
In [1]: 'xa0'.encode()
Out[1]: b'xc2xa0'
One exptra preceeding 'xc2'
byte would do the trick.
I also encountered the same issue and worked while using latin1 encoding, refer to the sample code to apply in your codebase. Give a try if above resolution doesn’t work.
df=pd.read_csv("../CSV_FILE.csv",na_values=missing, encoding='latin1')
I was able to open a csv file that gave me that answer, recoding the file by opening it in a notepad and saving it in utf-8, there it was able to open later without problems