UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte
Question:
I am using Python-2.6 CGI
scripts but found this error in server log while doing json.dumps()
,
Traceback (most recent call last):
File "/etc/mongodb/server/cgi-bin/getstats.py", line 135, in <module>
print json.dumps(__getdata())
File "/usr/lib/python2.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte
Here ,
__getdata()
function returns dictionary {}
.
Before posting this question I have referred this of question os SO.
UPDATES
Following line is hurting JSON encoder,
now = datetime.datetime.now()
now = datetime.datetime.strftime(now, '%Y-%m-%dT%H:%M:%S.%fZ')
print json.dumps({'current_time': now}) # this is the culprit
I got a temporary fix for it
print json.dumps( {'old_time': now.encode('ISO-8859-1').strip() })
But I am not sure is it correct way to do it.
Answers:
The error is because there is some non-ascii character in the dictionary and it can’t be encoded/decoded. One simple way to avoid this error is to encode such strings with encode()
function as follows (if a
is the string with non-ascii character):
a.encode('utf-8').strip()
Following line is hurting JSON encoder,
now = datetime.datetime.now()
now = datetime.datetime.strftime(now, '%Y-%m-%dT%H:%M:%S.%fZ')
print json.dumps({'current_time': now}) // this is the culprit
I got a temporary fix for it
print json.dumps( {'old_time': now.encode('ISO-8859-1').strip() })
Marking this as correct as a temporary fix (Not sure so).
Set default encoder at the top of your code
import sys
reload(sys)
sys.setdefaultencoding("ISO-8859-1")
Your string has a non ascii
character encoded in it.
Not being able to decode with utf-8
may happen if you’ve needed to use other encodings in your code. For example:
>>> 'my weird character x96'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:Python27libencodingsutf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 19: invalid start byte
In this case, the encoding is windows-1252
so you have to do:
>>> 'my weird character x96'.decode('windows-1252')
u'my weird character u2013'
Now that you have Unicode
, you can safely encode into utf-8
.
After trying all the aforementioned workarounds, if it still throws the same error, you can try exporting the file as CSV
(a second time if you already have).
Especially if you’re using scikit learn
, it is best to import
the dataset as a CSV file
.
I spent hours together, whereas the solution was this simple. Export the file as a CSV to the directory where Anaconda
or your classifier tools are installed and try.
Try the below code snippet:
with open(path, 'rb') as f:
text = f.read()
As of 2018-05 this is handled directly with decode
, at least for Python 3.
I’m using the below snippet for invalid start byte
and invalid continuation byte
type errors. Adding errors='ignore'
fixed it for me.
with open(out_file, 'rb') as f:
for line in f:
print(line.decode(errors='ignore'))
I switched this simply by defining a different codec package in the read_csv()
command:
encoding = 'unicode_escape'
Eg:
import pandas as pd
data = pd.read_csv(filename, encoding= 'unicode_escape')
Inspired by @aaronpenne and @Soumyaansh
f = open("file.txt", "rb")
text = f.read().decode(errors='replace')
If the above methods are not working for you, you may want to look into changing the encoding
of the csv file
itself.
Using Excel:
- Open
csv
file using Excel
- Navigate to File menu option and click Save As
- Click Browse to select a location to save the file
- Enter intended filename
- Select
CSV (Comma delimited) (*.csv)
option
- Click Tools drop-down box and click Web Options
- Under Encoding tab, select the option
Unicode (UTF-8)
from Save this document as drop-down list
- Save the file
Using Notepad:
- Open
csv file
using notepad
- Navigate to File > Save As option
- Next, select the location to the file
- Select the Save as type option as All Files(.)
- Specify the file name with
.csv
extension
- From Encoding drop-down list, select
UTF-8
option.
- Click Save to save the file
By doing this, you should be able to import csv
files without encountering the UnicodeCodeError
.
On read csv
, I added an encoding method:
import pandas as pd
dataset = pd.read_csv('sample_data.csv', header= 0,
encoding= 'unicode_escape')
You may use any standard encoding of your specific usage and input.
utf-8
is the default.
iso8859-1
is also popular for Western Europe.
e.g: bytes_obj.decode('iso8859-1')
see: docs
Simple Solution:
import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')
This solution worked for me:
import pandas as pd
data = pd.read_csv("training.csv", encoding = 'unicode_escape')
Instead of looking for ways to decode a5 (Yen ¥
) or 96 (en-dash –
), tell MySQL that your client is encoded “latin1”, but you want “utf8” in the database.
See details in Trouble with UTF-8 characters; what I see is not what I stored
In my case, i had to save the file as UTF8 with BOM not just as UTF8 utf8
then this error was gone.
from io import BytesIO
df = pd.read_excel(BytesIO(bytes_content), engine='openpyxl')
worked for me
The following snippet worked for me.
import pandas as pd
df = pd.read_csv(filename, sep = ';', encoding = 'latin1', error_bad_lines=False) #error_bad_lines is avoid single line error
I encountered the same error while trying to import to a pandas dataframe from an excel sheet on sharepoint. My solution was using engine=’openpyxl’. I’m also using requests_negotiate_sspi to avoid storing passwords in plain text.
import requests
from io import BytesIO
from requests_negotiate_sspi import HttpNegotiateAuth
cert = r'c:path_tosaved_certificate.cer'
target_file_url = r'https://share.companydomain.com/sites/Sitename/folder/excel_file.xlsx'
response = requests.get(target_file_url, auth=HttpNegotiateAuth(), verify=cert)
df = pd.read_excel(BytesIO(response.content), engine='openpyxl', sheet_name='Sheet1')
Simple solution:
import pandas as pd
df = pd.read_csv('file_name.csv', engine='python-fwf')
If it’s not working try to change the engine
to 'python'
or 'c'
.
I know this doesn’t fit directly to the question, but I repeatedly get directed to this when I google the error message.
I did get the error when I mistakenly tried to install a Python package like I would install requirements from a file, i.e., with -r
:
# wrong: leads to the error above
pip install -r my_package.whl
# correct: without -r
pip install my_package.whl
I hope this helps others who made the same little mistake as I did without noticing.
I am using Python-2.6 CGI
scripts but found this error in server log while doing json.dumps()
,
Traceback (most recent call last):
File "/etc/mongodb/server/cgi-bin/getstats.py", line 135, in <module>
print json.dumps(__getdata())
File "/usr/lib/python2.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte
Here ,
__getdata()
function returns dictionary {}
.
Before posting this question I have referred this of question os SO.
UPDATES
Following line is hurting JSON encoder,
now = datetime.datetime.now()
now = datetime.datetime.strftime(now, '%Y-%m-%dT%H:%M:%S.%fZ')
print json.dumps({'current_time': now}) # this is the culprit
I got a temporary fix for it
print json.dumps( {'old_time': now.encode('ISO-8859-1').strip() })
But I am not sure is it correct way to do it.
The error is because there is some non-ascii character in the dictionary and it can’t be encoded/decoded. One simple way to avoid this error is to encode such strings with encode()
function as follows (if a
is the string with non-ascii character):
a.encode('utf-8').strip()
Following line is hurting JSON encoder,
now = datetime.datetime.now()
now = datetime.datetime.strftime(now, '%Y-%m-%dT%H:%M:%S.%fZ')
print json.dumps({'current_time': now}) // this is the culprit
I got a temporary fix for it
print json.dumps( {'old_time': now.encode('ISO-8859-1').strip() })
Marking this as correct as a temporary fix (Not sure so).
Set default encoder at the top of your code
import sys
reload(sys)
sys.setdefaultencoding("ISO-8859-1")
Your string has a non ascii
character encoded in it.
Not being able to decode with utf-8
may happen if you’ve needed to use other encodings in your code. For example:
>>> 'my weird character x96'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:Python27libencodingsutf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 19: invalid start byte
In this case, the encoding is windows-1252
so you have to do:
>>> 'my weird character x96'.decode('windows-1252')
u'my weird character u2013'
Now that you have Unicode
, you can safely encode into utf-8
.
After trying all the aforementioned workarounds, if it still throws the same error, you can try exporting the file as CSV
(a second time if you already have).
Especially if you’re using scikit learn
, it is best to import
the dataset as a CSV file
.
I spent hours together, whereas the solution was this simple. Export the file as a CSV to the directory where Anaconda
or your classifier tools are installed and try.
Try the below code snippet:
with open(path, 'rb') as f:
text = f.read()
As of 2018-05 this is handled directly with decode
, at least for Python 3.
I’m using the below snippet for invalid start byte
and invalid continuation byte
type errors. Adding errors='ignore'
fixed it for me.
with open(out_file, 'rb') as f:
for line in f:
print(line.decode(errors='ignore'))
I switched this simply by defining a different codec package in the read_csv()
command:
encoding = 'unicode_escape'
Eg:
import pandas as pd
data = pd.read_csv(filename, encoding= 'unicode_escape')
Inspired by @aaronpenne and @Soumyaansh
f = open("file.txt", "rb")
text = f.read().decode(errors='replace')
If the above methods are not working for you, you may want to look into changing the encoding
of the csv file
itself.
Using Excel:
- Open
csv
file usingExcel
- Navigate to File menu option and click Save As
- Click Browse to select a location to save the file
- Enter intended filename
- Select
CSV (Comma delimited) (*.csv)
option - Click Tools drop-down box and click Web Options
- Under Encoding tab, select the option
Unicode (UTF-8)
from Save this document as drop-down list - Save the file
Using Notepad:
- Open
csv file
using notepad - Navigate to File > Save As option
- Next, select the location to the file
- Select the Save as type option as All Files(.)
- Specify the file name with
.csv
extension - From Encoding drop-down list, select
UTF-8
option. - Click Save to save the file
By doing this, you should be able to import csv
files without encountering the UnicodeCodeError
.
On read csv
, I added an encoding method:
import pandas as pd
dataset = pd.read_csv('sample_data.csv', header= 0,
encoding= 'unicode_escape')
You may use any standard encoding of your specific usage and input.
utf-8
is the default.
iso8859-1
is also popular for Western Europe.
e.g: bytes_obj.decode('iso8859-1')
see: docs
Simple Solution:
import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')
This solution worked for me:
import pandas as pd
data = pd.read_csv("training.csv", encoding = 'unicode_escape')
Instead of looking for ways to decode a5 (Yen ¥
) or 96 (en-dash –
), tell MySQL that your client is encoded “latin1”, but you want “utf8” in the database.
See details in Trouble with UTF-8 characters; what I see is not what I stored
In my case, i had to save the file as UTF8 with BOM not just as UTF8 utf8
then this error was gone.
from io import BytesIO
df = pd.read_excel(BytesIO(bytes_content), engine='openpyxl')
worked for me
The following snippet worked for me.
import pandas as pd
df = pd.read_csv(filename, sep = ';', encoding = 'latin1', error_bad_lines=False) #error_bad_lines is avoid single line error
I encountered the same error while trying to import to a pandas dataframe from an excel sheet on sharepoint. My solution was using engine=’openpyxl’. I’m also using requests_negotiate_sspi to avoid storing passwords in plain text.
import requests
from io import BytesIO
from requests_negotiate_sspi import HttpNegotiateAuth
cert = r'c:path_tosaved_certificate.cer'
target_file_url = r'https://share.companydomain.com/sites/Sitename/folder/excel_file.xlsx'
response = requests.get(target_file_url, auth=HttpNegotiateAuth(), verify=cert)
df = pd.read_excel(BytesIO(response.content), engine='openpyxl', sheet_name='Sheet1')
Simple solution:
import pandas as pd
df = pd.read_csv('file_name.csv', engine='python-fwf')
If it’s not working try to change the engine
to 'python'
or 'c'
.
I know this doesn’t fit directly to the question, but I repeatedly get directed to this when I google the error message.
I did get the error when I mistakenly tried to install a Python package like I would install requirements from a file, i.e., with -r
:
# wrong: leads to the error above
pip install -r my_package.whl
# correct: without -r
pip install my_package.whl
I hope this helps others who made the same little mistake as I did without noticing.