Python load json file with UTF-8 BOM header
Question:
I needed to parse files generated by other tool, which unconditionally outputs json file with UTF-8 BOM header (EFBBBF). I soon found that this was the problem, as Python 2.7 module can’t seem to parse it:
>>> import json
>>> data = json.load(open('sample.json'))
ValueError: No JSON object could be decoded
Removing BOM, solves it, but I wonder if there is another way of parsing json file with BOM header?
Answers:
You can open with codecs
:
import json
import codecs
json.load(codecs.open('sample.json', 'r', 'utf-8-sig'))
or decode with utf-8-sig
yourself and pass to loads
:
json.loads(open('sample.json').read().decode('utf-8-sig'))
Since json.load(stream)
uses json.loads(stream.read())
under the hood, it won’t be that bad to write a small hepler function that lstrips the BOM:
from codecs import BOM_UTF8
def lstrip_bom(str_, bom=BOM_UTF8):
if str_.startswith(bom):
return str_[len(bom):]
else:
return str_
json.loads(lstrip_bom(open('sample.json').read()))
In other situations where you need to wrap a stream and fix it somehow you may look at inheriting from codecs.StreamReader
.
If this is a one-off, a very simple super high-tech solution that worked for me…
- Open the JSON file in your favorite text editor.
- Select-all
- Create a new file
- Paste
- Save.
BOOM, BOM header gone!
you can also do it with keyword with
import codecs
with codecs.open('samples.json', 'r', 'utf-8-sig') as json_file:
data = json.load(json_file)
or better:
import io
with io.open('samples.json', 'r', encoding='utf-8-sig') as json_file:
data = json.load(json_file)
Simple! You don’t even need to import codecs
.
with open('sample.json', encoding='utf-8-sig') as f:
data = json.load(f)
I removed the BOM manually with Linux command.
First I check if there are efbb bf
bytes for the file, with head i_have_BOM | xxd
.
Then I run dd bs=1 skip=3 if=i_have_BOM.json of=I_dont_have_BOM.json
.
bs=1
process 1 byte each time, skip=3
, skip the first 3 bytes.
I’m using utf-8-sig just with import json
with open('estados.json', encoding='utf-8-sig') as json_file:
data = json.load(json_file)
print(data)
I needed to parse files generated by other tool, which unconditionally outputs json file with UTF-8 BOM header (EFBBBF). I soon found that this was the problem, as Python 2.7 module can’t seem to parse it:
>>> import json
>>> data = json.load(open('sample.json'))
ValueError: No JSON object could be decoded
Removing BOM, solves it, but I wonder if there is another way of parsing json file with BOM header?
You can open with codecs
:
import json
import codecs
json.load(codecs.open('sample.json', 'r', 'utf-8-sig'))
or decode with utf-8-sig
yourself and pass to loads
:
json.loads(open('sample.json').read().decode('utf-8-sig'))
Since json.load(stream)
uses json.loads(stream.read())
under the hood, it won’t be that bad to write a small hepler function that lstrips the BOM:
from codecs import BOM_UTF8
def lstrip_bom(str_, bom=BOM_UTF8):
if str_.startswith(bom):
return str_[len(bom):]
else:
return str_
json.loads(lstrip_bom(open('sample.json').read()))
In other situations where you need to wrap a stream and fix it somehow you may look at inheriting from codecs.StreamReader
.
If this is a one-off, a very simple super high-tech solution that worked for me…
- Open the JSON file in your favorite text editor.
- Select-all
- Create a new file
- Paste
- Save.
BOOM, BOM header gone!
you can also do it with keyword with
import codecs
with codecs.open('samples.json', 'r', 'utf-8-sig') as json_file:
data = json.load(json_file)
or better:
import io
with io.open('samples.json', 'r', encoding='utf-8-sig') as json_file:
data = json.load(json_file)
Simple! You don’t even need to import codecs
.
with open('sample.json', encoding='utf-8-sig') as f:
data = json.load(f)
I removed the BOM manually with Linux command.
First I check if there are efbb bf
bytes for the file, with head i_have_BOM | xxd
.
Then I run dd bs=1 skip=3 if=i_have_BOM.json of=I_dont_have_BOM.json
.
bs=1
process 1 byte each time, skip=3
, skip the first 3 bytes.
I’m using utf-8-sig just with import json
with open('estados.json', encoding='utf-8-sig') as json_file:
data = json.load(json_file)
print(data)