Pandas: reading indented JSON created by to_json

Question:

I’m writing JSON to a file using DataFrame.to_json() with the indent option:

df.to_json(path_or_buf=file_json, orient="records", lines=True, indent=2)

The important part here is indent=2, otherwise it works.
Then how do I read this file using DataFrame.read_json()?
I’m trying the code below, but it expects the file to be a JSON object per line, so the indentation messes things up:

df = pd.read_json(file_json, lines=True)

I didn’t find any options in read_json to make it handle the indentation.
How else could I read this file created by to_json, possibly avoiding writing my own reader?

Asked By: Soid

||

Answers:

The combination of lines=True, orient='records', and indent=2 doesn’t actually produce valid json.

lines=True is meant to create line-delimited json, but indent=2 adds extra lines. You can’t have your delimiter be line breaks, AND have extra line breaks!

If you do just orient='records', and indent=2, then it does produce valid json.

The current read_json(lines=True) code can be found here:

def _combine_lines(self, lines) -> str:
    """
    Combines a list of JSON objects into one JSON object.
    """
    return (
        f'[{",".join([line for line in (line.strip() for line in lines) if line])}]'
    )

You can see that it expects to read the file line by line, which isn’t possible when indent has been used.

Answered By: BeRT2me

The other answer is good, but it turned out it requires reading the entire file in memory. I ended up writing a simple lazy parser that I include below. It requires removing lines=True argument in df.to_json.

The use is following:

for obj, pos, length in lazy_read_json('file.json'):
    print(obj['field'])  # access json object

It includes pos – start position in file for the object, and length – the length of object in file; it allows some more functionality for me, like being able to index object and load them to memory on demand.

The parser is below:

def lazy_read_json(filename: str):
    """
    :return generator returning (json_obj, pos, lenth)

    >>> test_objs = [{'a': 11, 'b': 22, 'c': {'abc': 'z', 'zzz': {}}}, 
                {'a': 31, 'b': 42, 'c': [{'abc': 'z', 'zzz': {}}]}, 
                {'a': 55, 'b': 66, 'c': [{'abc': 'z'}, {'z': 3}, {'y': 3}]}, 
                {'a': 71, 'b': 62, 'c': 63}]
    >>> json_str = json.dumps(test_objs, indent=4, sort_keys=True)
    >>> _create_file("/tmp/test.json", [json_str])
    >>> g = lazy_read_json("/tmp/test.json")
    >>> next(g)
    ({'a': 11, 'b': 22, 'c': {'abc': 'z', 'zzz': {}}}, 120, 116)
    >>> next(g)
    ({'a': 31, 'b': 42, 'c': [{'abc': 'z', 'zzz': {}}]}, 274, 152)
    >>> next(g)
    ({'a': 55, 'b': 66, 'c': [{'abc': 'z'}, {'z': 3}, {'y': 3}]}, 505, 229)
    >>> next(g)
    ({'a': 71, 'b': 62, 'c': 63}, 567, 62)
    >>> next(g)
    Traceback (most recent call last):
      ...
    StopIteration
    """
    with open(filename) as fh:
        state = 0
        json_str = ''
        cb_depth = 0  # curly brace depth
        line = fh.readline()
        while line:
            if line[-1] == "n":
                line = line[:-1]
            line_strip = line.strip()
            if state == 0 and line == '[':
                state = 1
                pos = fh.tell()
            elif state == 1 and line_strip == '{':
                state = 2
                json_str += line + "n"
            elif state == 2:
                if len(line_strip) > 0 and line_strip[-1] == '{':  # count nested objects
                    cb_depth += 1

                json_str += line + "n"
                if cb_depth == 0 and (line_strip == '},' or line_strip == '}'):
                    # end of parsing an object
                    if json_str[-2:] == ",n":
                        json_str = json_str[:-2]  # remove trailing comma
                    state = 1
                    obj = json.loads(json_str)
                    yield obj, pos, len(json_str)
                    pos = fh.tell()
                    json_str = ""
                elif line_strip == '}' or line_strip == '},':
                    cb_depth -= 1

            line = fh.readline()


# this function is for doctest
def _create_file(filename, lines):
    # cause doctest can't input new line characters :(
    f = open(filename, "w")
    for line in lines:
        f.write(line)
        f.write("n")
    f.close()
Answered By: Soid
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.