Pandas: reading indented JSON created by to_json
Question:
I’m writing JSON to a file using DataFrame.to_json()
with the indent option:
df.to_json(path_or_buf=file_json, orient="records", lines=True, indent=2)
The important part here is indent=2
, otherwise it works.
Then how do I read this file using DataFrame.read_json()
?
I’m trying the code below, but it expects the file to be a JSON object per line, so the indentation messes things up:
df = pd.read_json(file_json, lines=True)
I didn’t find any options in read_json to make it handle the indentation.
How else could I read this file created by to_json
, possibly avoiding writing my own reader?
Answers:
The combination of lines=True
, orient='records'
, and indent=2
doesn’t actually produce valid json
.
lines=True
is meant to create line-delimited json
, but indent=2
adds extra lines. You can’t have your delimiter be line breaks, AND have extra line breaks!
If you do just orient='records'
, and indent=2
, then it does produce valid json
.
The current read_json(lines=True)
code can be found here:
def _combine_lines(self, lines) -> str:
"""
Combines a list of JSON objects into one JSON object.
"""
return (
f'[{",".join([line for line in (line.strip() for line in lines) if line])}]'
)
You can see that it expects to read the file line by line, which isn’t possible when indent
has been used.
The other answer is good, but it turned out it requires reading the entire file in memory. I ended up writing a simple lazy parser that I include below. It requires removing lines=True
argument in df.to_json
.
The use is following:
for obj, pos, length in lazy_read_json('file.json'):
print(obj['field']) # access json object
It includes pos
– start position in file for the object, and length
– the length of object in file; it allows some more functionality for me, like being able to index object and load them to memory on demand.
The parser is below:
def lazy_read_json(filename: str):
"""
:return generator returning (json_obj, pos, lenth)
>>> test_objs = [{'a': 11, 'b': 22, 'c': {'abc': 'z', 'zzz': {}}},
{'a': 31, 'b': 42, 'c': [{'abc': 'z', 'zzz': {}}]},
{'a': 55, 'b': 66, 'c': [{'abc': 'z'}, {'z': 3}, {'y': 3}]},
{'a': 71, 'b': 62, 'c': 63}]
>>> json_str = json.dumps(test_objs, indent=4, sort_keys=True)
>>> _create_file("/tmp/test.json", [json_str])
>>> g = lazy_read_json("/tmp/test.json")
>>> next(g)
({'a': 11, 'b': 22, 'c': {'abc': 'z', 'zzz': {}}}, 120, 116)
>>> next(g)
({'a': 31, 'b': 42, 'c': [{'abc': 'z', 'zzz': {}}]}, 274, 152)
>>> next(g)
({'a': 55, 'b': 66, 'c': [{'abc': 'z'}, {'z': 3}, {'y': 3}]}, 505, 229)
>>> next(g)
({'a': 71, 'b': 62, 'c': 63}, 567, 62)
>>> next(g)
Traceback (most recent call last):
...
StopIteration
"""
with open(filename) as fh:
state = 0
json_str = ''
cb_depth = 0 # curly brace depth
line = fh.readline()
while line:
if line[-1] == "n":
line = line[:-1]
line_strip = line.strip()
if state == 0 and line == '[':
state = 1
pos = fh.tell()
elif state == 1 and line_strip == '{':
state = 2
json_str += line + "n"
elif state == 2:
if len(line_strip) > 0 and line_strip[-1] == '{': # count nested objects
cb_depth += 1
json_str += line + "n"
if cb_depth == 0 and (line_strip == '},' or line_strip == '}'):
# end of parsing an object
if json_str[-2:] == ",n":
json_str = json_str[:-2] # remove trailing comma
state = 1
obj = json.loads(json_str)
yield obj, pos, len(json_str)
pos = fh.tell()
json_str = ""
elif line_strip == '}' or line_strip == '},':
cb_depth -= 1
line = fh.readline()
# this function is for doctest
def _create_file(filename, lines):
# cause doctest can't input new line characters :(
f = open(filename, "w")
for line in lines:
f.write(line)
f.write("n")
f.close()
I’m writing JSON to a file using DataFrame.to_json()
with the indent option:
df.to_json(path_or_buf=file_json, orient="records", lines=True, indent=2)
The important part here is indent=2
, otherwise it works.
Then how do I read this file using DataFrame.read_json()
?
I’m trying the code below, but it expects the file to be a JSON object per line, so the indentation messes things up:
df = pd.read_json(file_json, lines=True)
I didn’t find any options in read_json to make it handle the indentation.
How else could I read this file created by to_json
, possibly avoiding writing my own reader?
The combination of lines=True
, orient='records'
, and indent=2
doesn’t actually produce valid json
.
lines=True
is meant to create line-delimited json
, but indent=2
adds extra lines. You can’t have your delimiter be line breaks, AND have extra line breaks!
If you do just orient='records'
, and indent=2
, then it does produce valid json
.
The current read_json(lines=True)
code can be found here:
def _combine_lines(self, lines) -> str:
"""
Combines a list of JSON objects into one JSON object.
"""
return (
f'[{",".join([line for line in (line.strip() for line in lines) if line])}]'
)
You can see that it expects to read the file line by line, which isn’t possible when indent
has been used.
The other answer is good, but it turned out it requires reading the entire file in memory. I ended up writing a simple lazy parser that I include below. It requires removing lines=True
argument in df.to_json
.
The use is following:
for obj, pos, length in lazy_read_json('file.json'):
print(obj['field']) # access json object
It includes pos
– start position in file for the object, and length
– the length of object in file; it allows some more functionality for me, like being able to index object and load them to memory on demand.
The parser is below:
def lazy_read_json(filename: str):
"""
:return generator returning (json_obj, pos, lenth)
>>> test_objs = [{'a': 11, 'b': 22, 'c': {'abc': 'z', 'zzz': {}}},
{'a': 31, 'b': 42, 'c': [{'abc': 'z', 'zzz': {}}]},
{'a': 55, 'b': 66, 'c': [{'abc': 'z'}, {'z': 3}, {'y': 3}]},
{'a': 71, 'b': 62, 'c': 63}]
>>> json_str = json.dumps(test_objs, indent=4, sort_keys=True)
>>> _create_file("/tmp/test.json", [json_str])
>>> g = lazy_read_json("/tmp/test.json")
>>> next(g)
({'a': 11, 'b': 22, 'c': {'abc': 'z', 'zzz': {}}}, 120, 116)
>>> next(g)
({'a': 31, 'b': 42, 'c': [{'abc': 'z', 'zzz': {}}]}, 274, 152)
>>> next(g)
({'a': 55, 'b': 66, 'c': [{'abc': 'z'}, {'z': 3}, {'y': 3}]}, 505, 229)
>>> next(g)
({'a': 71, 'b': 62, 'c': 63}, 567, 62)
>>> next(g)
Traceback (most recent call last):
...
StopIteration
"""
with open(filename) as fh:
state = 0
json_str = ''
cb_depth = 0 # curly brace depth
line = fh.readline()
while line:
if line[-1] == "n":
line = line[:-1]
line_strip = line.strip()
if state == 0 and line == '[':
state = 1
pos = fh.tell()
elif state == 1 and line_strip == '{':
state = 2
json_str += line + "n"
elif state == 2:
if len(line_strip) > 0 and line_strip[-1] == '{': # count nested objects
cb_depth += 1
json_str += line + "n"
if cb_depth == 0 and (line_strip == '},' or line_strip == '}'):
# end of parsing an object
if json_str[-2:] == ",n":
json_str = json_str[:-2] # remove trailing comma
state = 1
obj = json.loads(json_str)
yield obj, pos, len(json_str)
pos = fh.tell()
json_str = ""
elif line_strip == '}' or line_strip == '},':
cb_depth -= 1
line = fh.readline()
# this function is for doctest
def _create_file(filename, lines):
# cause doctest can't input new line characters :(
f = open(filename, "w")
for line in lines:
f.write(line)
f.write("n")
f.close()