How I can I lazily read multiple JSON values from a file/stream in Python?

Question:

I’d like to read multiple JSON objects from a file/stream in Python, one at a time. Unfortunately json.load() just .read()s until end-of-file; there doesn’t seem to be any way to use it to read a single object or to lazily iterate over the objects.

Is there any way to do this? Using the standard library would be ideal, but if there’s a third-party library I’d use that instead.

At the moment I’m putting each object on a separate line and using json.loads(f.readline()), but I would really prefer not to need to do this.

Example Use

example.py

import my_json as json
import sys

for o in json.iterload(sys.stdin):
    print("Working on a", type(o))

in.txt

{"foo": ["bar", "baz"]} 1 2 [] 4 5 6

example session

$ python3.2 example.py < in.txt
Working on a dict
Working on a int
Working on a int
Working on a list
Working on a int
Working on a int
Working on a int
Asked By: Jeremy

||

Answers:

JSON generally isn’t very good for this sort of incremental use; there’s no standard way to serialise multiple objects so that they can easily be loaded one at a time, without parsing the whole lot.

The object per line solution that you’re using is seen elsewhere too. Scrapy calls it ‘JSON lines’:

You can do it slightly more Pythonically:

for jsonline in f:
    yield json.loads(jsonline)   # or do the processing in this loop

I think this is about the best way – it doesn’t rely on any third party libraries, and it’s easy to understand what’s going on. I’ve used it in some of my own code as well.

Answered By: Thomas K

Sure you can do this. You just have to take to raw_decode directly. This implementation loads the whole file into memory and operates on that string (much as json.load does); if you have large files you can modify it to only read from the file as necessary without much difficulty.

import json
from json.decoder import WHITESPACE

def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs):
    if isinstance(string_or_fp, file):
        string = string_or_fp.read()
    else:
        string = str(string_or_fp)

    decoder = cls(**kwargs)
    idx = WHITESPACE.match(string, 0).end()
    while idx < len(string):
        obj, end = decoder.raw_decode(string, idx)
        yield obj
        idx = WHITESPACE.match(string, end).end()

Usage: just as you requested, it’s a generator.

Answered By: Jeremy Roman

This is a pretty nasty problem actually because you have to stream in lines, but pattern match across multiple lines against braces, but also pattern match json. It’s a sort of json-preparse followed by a json parse. Json is, in comparison to other formats, easy to parse so it’s not always necessary to go for a parsing library, nevertheless, how to should we solve these conflicting issues?

Generators to the rescue!

The beauty of generators for a problem like this is you can stack them on top of each other gradually abstracting away the difficulty of the problem whilst maintaining laziness. I also considered using the mechanism for passing back values into a generator (send()) but fortunately found I didn’t need to use that.

To solve the first of the problems you need some sort of streamingfinditer, as a streaming version of re.finditer. My attempt at this below pulls in lines as needed (uncomment the debug statement to see) whilst still returning matches. I actually then modified it slightly to yield non-matched lines as well as matches (marked as 0 or 1 in the first part of the yielded tuple).

import re

def streamingfinditer(pat,stream):
  for s in stream:
#    print "Read next line: " + s
    while 1:
      m = re.search(pat,s)
      if not m:
        yield (0,s)
        break
      yield (1,m.group())
      s = re.split(pat,s,1)[1]

With that, it’s then possible to match up until braces, account each time for whether the braces are balanced, and then return either simple or compound objects as appropriate.

braces='{}[]'
whitespaceesc=' t'
bracesesc='\'+'\'.join(braces)
balancemap=dict(zip(braces,[1,-1,1,-1]))
bracespat='['+bracesesc+']'
nobracespat='[^'+bracesesc+']*'
untilbracespat=nobracespat+bracespat

def simpleorcompoundobjects(stream):
  obj = ""
  unbalanced = 0
  for (c,m) in streamingfinditer(re.compile(untilbracespat),stream):
    if (c == 0): # remainder of line returned, nothing interesting
      if (unbalanced == 0):
        yield (0,m)
      else:
        obj += m
    if (c == 1): # match returned
      if (unbalanced == 0):
        yield (0,m[:-1])
        obj += m[-1]
      else:
        obj += m
      unbalanced += balancemap[m[-1]]
      if (unbalanced == 0):
        yield (1,obj)
        obj="" 

This returns tuples as follows:

(0,"String of simple non-braced objects easy to parse")
(1,"{ 'Compound' : 'objects' }")

Basically that’s the nasty part done. We now just have to do the final level of parsing as we see fit. For example we can use Jeremy Roman’s iterload function (Thanks!) to do parsing for a single line:

def streamingiterload(stream):
  for c,o in simpleorcompoundobjects(stream):
    for x in iterload(o):
      yield x 

Test it:

of = open("test.json","w") 
of.write("""[ "hello" ] { "goodbye" : 1 } 1 2 {
} 2
9 78
 4 5 { "animals" : [ "dog" , "lots of mice" ,
 "cat" ] }
""")
of.close()
// open & stream the json
f = open("test.json","r")
for o in streamingiterload(f.readlines()):
  print o
f.close()

I get these results (and if you turn on that debug line, you’ll see it pulls in the lines as needed):

[u'hello']
{u'goodbye': 1}
1
2
{}
2
9
78
4
5
{u'animals': [u'dog', u'lots of mice', u'cat']}

This won’t work for all situations. Due to the implementation of the json library, it is impossible to work entirely correctly without reimplementing the parser yourself.

Answered By: Benedict

I’d like to provide a solution. The key thought is to “try” to decode: if it fails, give it more feed, otherwise use the offset information to prepare next decoding.

However the current json module can’t tolerate SPACE in head of string to be decoded, so I have to strip them off.

import sys
import json

def iterload(file):
    buffer = ""
    dec = json.JSONDecoder()
    for line in file:         
        buffer = buffer.strip(" nrt") + line.strip(" nrt")
        while(True):
            try:
                r = dec.raw_decode(buffer)
            except:
                break
            yield r[0]
            buffer = buffer[r[1]:].strip(" nrt")


for o in iterload(sys.stdin):
    print("Working on a", type(o),  o)

=========================
I have tested for several txt files, and it works fine.
(in1.txt)

{"foo": ["bar", "baz"]
}
 1 2 [
  ]  4
{"foo1": ["bar1", {"foo2":{"A":1, "B":3}, "DDD":4}]
}
 5   6

(in2.txt)

{"foo"
: ["bar",
  "baz"]
  } 
1 2 [
] 4 5 6

(in.txt, your initial)

{"foo": ["bar", "baz"]} 1 2 [] 4 5 6

(output for Benedict’s testcase)

python test.py < in.txt
('Working on a', <type 'list'>, [u'hello'])
('Working on a', <type 'dict'>, {u'goodbye': 1})
('Working on a', <type 'int'>, 1)
('Working on a', <type 'int'>, 2)
('Working on a', <type 'dict'>, {})
('Working on a', <type 'int'>, 2)
('Working on a', <type 'int'>, 9)
('Working on a', <type 'int'>, 78)
('Working on a', <type 'int'>, 4)
('Working on a', <type 'int'>, 5)
('Working on a', <type 'dict'>, {u'animals': [u'dog', u'lots of mice', u'cat']})
Answered By: wuliang

Here’s mine:

import simplejson as json
from simplejson import JSONDecodeError
class StreamJsonListLoader():
    """
    When you have a big JSON file containint a list, such as

    [{
        ...
    },
    {
        ...
    },
    {
        ...
    },
    ...
    ]

    And it's too big to be practically loaded into memory and parsed by json.load,
    This class comes to the rescue. It lets you lazy-load the large json list.
    """

    def __init__(self, filename_or_stream):
        if type(filename_or_stream) == str:
            self.stream = open(filename_or_stream)
        else:
            self.stream = filename_or_stream

        if not self.stream.read(1) == '[':
            raise NotImplementedError('Only JSON-streams of lists (that start with a [) are supported.')

    def __iter__(self):
        return self

    def next(self):
        read_buffer = self.stream.read(1)
        while True:
            try:
                json_obj = json.loads(read_buffer)

                if not self.stream.read(1) in [',',']']:
                    raise Exception('JSON seems to be malformed: object is not followed by comma (,) or end of list (]).')
                return json_obj
            except JSONDecodeError:
                next_char = self.stream.read(1)
                read_buffer += next_char
                while next_char != '}':
                    next_char = self.stream.read(1)
                    if next_char == '':
                        raise StopIteration
                    read_buffer += next_char
Answered By: user3542882

A little late maybe, but I had this exact problem (well, more or less). My standard solution for these problems is usually to just do a regex split on some well-known root object, but in my case it was impossible. The only feasible way to do this generically is to implement a proper tokenizer.

After not finding a generic-enough and reasonably well-performing solution, I ended doing this myself, writing the splitstream module. It is a pre-tokenizer that understands JSON and XML and splits a continuous stream into multiple chunks for parsing (it leaves the actual parsing up to you though). To get some kind of performance out of it, it is written as a C module.

Example:

from splitstream import splitfile

for jsonstr in splitfile(sys.stdin, format="json")):
    yield json.loads(jsonstr)
Answered By: Krumelur

I used @wuilang’s elegant solution. The simple approach — read a byte, try to decode, read a byte, try to decode, … — worked, but unfortunately it was very slow.

In my case, I was trying to read “pretty-printed” JSON objects of the same object type from a file. This allowed me to optimize the approach; I could read the file line-by-line, only decoding when I found a line that contained exactly “}”:

def iterload(stream):
    buf = ""
    dec = json.JSONDecoder()
    for line in stream:
        line = line.rstrip()
        buf = buf + line
        if line == "}":
            yield dec.raw_decode(buf)
            buf = ""

If you happen to be working with one-per-line compact JSON that escapes newlines in string literals, then you can safely simplify this approach even more:

def iterload(stream):
    dec = json.JSONDecoder()
    for line in stream:
        yield dec.raw_decode(line)

Obviously, these simple approaches only work for very specific kinds of JSON. However, if these assumptions hold, these solutions work correctly and quickly.

Answered By: sigpwned

Here’s a much, much simpler solution. The secret is to try, fail, and use the information in the exception to parse correctly. The only limitation is the file must be seekable.

def stream_read_json(fn):
    import json
    start_pos = 0
    with open(fn, 'r') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except json.JSONDecodeError as e:
                f.seek(start_pos)
                json_str = f.read(e.pos)
                obj = json.loads(json_str)
                start_pos += e.pos
                yield obj

Edit: just noticed that this will only work for Python >=3.5. For earlier, failures return a ValueError, and you have to parse out the position from the string, e.g.

def stream_read_json(fn):
    import json
    import re
    start_pos = 0
    with open(fn, 'r') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except ValueError as e:
                f.seek(start_pos)
                end_pos = int(re.match('Extra data: line d+ column d+ .*(char (d+).*)',
                                    e.args[0]).groups()[0])
                json_str = f.read(end_pos)
                obj = json.loads(json_str)
                start_pos += end_pos
                yield obj
Answered By: Nic Watson

I believe a better way of doing it would be to use a state machine. Below is a sample code that I worked out by converting a NodeJS code on below link to Python 3 (used nonlocal keyword only available in Python 3, code won’t work on Python 2)

Edit-1: Updated and made code compatible with Python 2

Edit-2: Updated and added a Python3 only version as well

https://gist.github.com/creationix/5992451

Python 3 only version

# A streaming byte oriented JSON parser.  Feed it a single byte at a time and
# it will emit complete objects as it comes across them.  Whitespace within and
# between objects is ignored.  This means it can parse newline delimited JSON.
import math


def json_machine(emit, next_func=None):
    def _value(byte_data):
        if not byte_data:
            return

        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _value  # Ignore whitespace

        if byte_data == 0x22:  # "
            return string_machine(on_value)

        if byte_data == 0x2d or (0x30 <= byte_data < 0x40):  # - or 0-9
            return number_machine(byte_data, on_number)

        if byte_data == 0x7b:  #:
            return object_machine(on_value)

        if byte_data == 0x5b:  # [
            return array_machine(on_value)

        if byte_data == 0x74:  # t
            return constant_machine(TRUE, True, on_value)

        if byte_data == 0x66:  # f
            return constant_machine(FALSE, False, on_value)

        if byte_data == 0x6e:  # n
            return constant_machine(NULL, None, on_value)

        if next_func == _value:
            raise Exception("Unexpected 0x" + str(byte_data))

        return next_func(byte_data)

    def on_value(value):
        emit(value)
        return next_func

    def on_number(number, byte):
        emit(number)
        return _value(byte)

    next_func = next_func or _value
    return _value


TRUE = [0x72, 0x75, 0x65]
FALSE = [0x61, 0x6c, 0x73, 0x65]
NULL = [0x75, 0x6c, 0x6c]


def constant_machine(bytes_data, value, emit):
    i = 0
    length = len(bytes_data)

    def _constant(byte_data):
        nonlocal i
        if byte_data != bytes_data[i]:
            i += 1
            raise Exception("Unexpected 0x" + str(byte_data))

        i += 1
        if i < length:
            return _constant
        return emit(value)

    return _constant


def string_machine(emit):
    string = ""

    def _string(byte_data):
        nonlocal string

        if byte_data == 0x22:  # "
            return emit(string)

        if byte_data == 0x5c:  # 
            return _escaped_string

        if byte_data & 0x80:  # UTF-8 handling
            return utf8_machine(byte_data, on_char_code)

        if byte_data < 0x20:  # ASCII control character
            raise Exception("Unexpected control character: 0x" + str(byte_data))

        string += chr(byte_data)
        return _string

    def _escaped_string(byte_data):
        nonlocal string

        if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f:  # "  /
            string += chr(byte_data)
            return _string

        if byte_data == 0x62:  # b
            string += "b"
            return _string

        if byte_data == 0x66:  # f
            string += "f"
            return _string

        if byte_data == 0x6e:  # n
            string += "n"
            return _string

        if byte_data == 0x72:  # r
            string += "r"
            return _string

        if byte_data == 0x74:  # t
            string += "t"
            return _string

        if byte_data == 0x75:  # u
            return hex_machine(on_char_code)

    def on_char_code(char_code):
        nonlocal string
        string += chr(char_code)
        return _string

    return _string


# Nestable state machine for UTF-8 Decoding.
def utf8_machine(byte_data, emit):
    left = 0
    num = 0

    def _utf8(byte_data):
        nonlocal num, left
        if (byte_data & 0xc0) != 0x80:
            raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16))

        left = left - 1

        num |= (byte_data & 0x3f) << (left * 6)
        if left:
            return _utf8
        return emit(num)

    if 0xc0 <= byte_data < 0xe0:  # 2-byte UTF-8 Character
        left = 1
        num = (byte_data & 0x1f) << 6
        return _utf8

    if 0xe0 <= byte_data < 0xf0:  # 3-byte UTF-8 Character
        left = 2
        num = (byte_data & 0xf) << 12
        return _utf8

    if 0xf0 <= byte_data < 0xf8:  # 4-byte UTF-8 Character
        left = 3
        num = (byte_data & 0x07) << 18
        return _utf8

    raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data))


# Nestable state machine for hex escaped characters
def hex_machine(emit):
    left = 4
    num = 0

    def _hex(byte_data):
        nonlocal num, left

        if 0x30 <= byte_data < 0x40:
            i = byte_data - 0x30
        elif 0x61 <= byte_data <= 0x66:
            i = byte_data - 0x57
        elif 0x41 <= byte_data <= 0x46:
            i = byte_data - 0x37
        else:
            raise Exception("Expected hex char in string hex escape")

        left -= 1
        num |= i << (left * 4)

        if left:
            return _hex
        return emit(num)

    return _hex


def number_machine(byte_data, emit):
    sign = 1
    number = 0
    decimal = 0
    esign = 1
    exponent = 0

    def _mid(byte_data):
        if byte_data == 0x2e:  # .
            return _decimal

        return _later(byte_data)

    def _number(byte_data):
        nonlocal number
        if 0x30 <= byte_data < 0x40:
            number = number * 10 + (byte_data - 0x30)
            return _number

        return _mid(byte_data)

    def _start(byte_data):
        if byte_data == 0x30:
            return _mid

        if 0x30 < byte_data < 0x40:
            return _number(byte_data)

        raise Exception("Invalid number: 0x" + str(byte_data))

    if byte_data == 0x2d:  # -
        sign = -1
        return _start

    def _decimal(byte_data):
        nonlocal decimal
        if 0x30 <= byte_data < 0x40:
            decimal = (decimal + byte_data - 0x30) / 10
            return _decimal

        return _later(byte_data)

    def _later(byte_data):
        if byte_data == 0x45 or byte_data == 0x65:  # E e
            return _esign

        return _done(byte_data)

    def _esign(byte_data):
        nonlocal esign
        if byte_data == 0x2b:  # +
            return _exponent

        if byte_data == 0x2d:  # -
            esign = -1
            return _exponent

        return _exponent(byte_data)

    def _exponent(byte_data):
        nonlocal exponent
        if 0x30 <= byte_data < 0x40:
            exponent = exponent * 10 + (byte_data - 0x30)
            return _exponent

        return _done(byte_data)

    def _done(byte_data):
        value = sign * (number + decimal)
        if exponent:
            value *= math.pow(10, esign * exponent)

        return emit(value, byte_data)

    return _start(byte_data)


def array_machine(emit):
    array_data = []

    def _array(byte_data):
        if byte_data == 0x5d:  # ]
            return emit(array_data)

        return json_machine(on_value, _comma)(byte_data)

    def on_value(value):
        array_data.append(value)

    def _comma(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _comma  # Ignore whitespace

        if byte_data == 0x2c:  # ,
            return json_machine(on_value, _comma)

        if byte_data == 0x5d:  # ]
            return emit(array_data)

        raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body")

    return _array


def object_machine(emit):
    object_data = {}
    key = None

    def _object(byte_data):
        if byte_data == 0x7d:  #
            return emit(object_data)

        return _key(byte_data)

    def _key(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _object  # Ignore whitespace

        if byte_data == 0x22:
            return string_machine(on_key)

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    def on_key(result):
        nonlocal key
        key = result
        return _colon

    def _colon(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _colon  # Ignore whitespace

        if byte_data == 0x3a:  # :
            return json_machine(on_value, _comma)

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    def on_value(value):
        object_data[key] = value

    def _comma(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _comma  # Ignore whitespace

        if byte_data == 0x2c:  # ,
            return _key

        if byte_data == 0x7d:  #
            return emit(object_data)

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    return _object

Python 2 compatible version

# A streaming byte oriented JSON parser.  Feed it a single byte at a time and
# it will emit complete objects as it comes across them.  Whitespace within and
# between objects is ignored.  This means it can parse newline delimited JSON.
import math


def json_machine(emit, next_func=None):
    def _value(byte_data):
        if not byte_data:
            return

        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _value  # Ignore whitespace

        if byte_data == 0x22:  # "
            return string_machine(on_value)

        if byte_data == 0x2d or (0x30 <= byte_data < 0x40):  # - or 0-9
            return number_machine(byte_data, on_number)

        if byte_data == 0x7b:  #:
            return object_machine(on_value)

        if byte_data == 0x5b:  # [
            return array_machine(on_value)

        if byte_data == 0x74:  # t
            return constant_machine(TRUE, True, on_value)

        if byte_data == 0x66:  # f
            return constant_machine(FALSE, False, on_value)

        if byte_data == 0x6e:  # n
            return constant_machine(NULL, None, on_value)

        if next_func == _value:
            raise Exception("Unexpected 0x" + str(byte_data))

        return next_func(byte_data)

    def on_value(value):
        emit(value)
        return next_func

    def on_number(number, byte):
        emit(number)
        return _value(byte)

    next_func = next_func or _value
    return _value


TRUE = [0x72, 0x75, 0x65]
FALSE = [0x61, 0x6c, 0x73, 0x65]
NULL = [0x75, 0x6c, 0x6c]


def constant_machine(bytes_data, value, emit):
    local_data = {"i": 0, "length": len(bytes_data)}

    def _constant(byte_data):
        # nonlocal i, length
        if byte_data != bytes_data[local_data["i"]]:
            local_data["i"] += 1
            raise Exception("Unexpected 0x" + byte_data.toString(16))

        local_data["i"] += 1

        if local_data["i"] < local_data["length"]:
            return _constant
        return emit(value)

    return _constant


def string_machine(emit):
    local_data = {"string": ""}

    def _string(byte_data):
        # nonlocal string

        if byte_data == 0x22:  # "
            return emit(local_data["string"])

        if byte_data == 0x5c:  # 
            return _escaped_string

        if byte_data & 0x80:  # UTF-8 handling
            return utf8_machine(byte_data, on_char_code)

        if byte_data < 0x20:  # ASCII control character
            raise Exception("Unexpected control character: 0x" + byte_data.toString(16))

        local_data["string"] += chr(byte_data)
        return _string

    def _escaped_string(byte_data):
        # nonlocal string

        if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f:  # "  /
            local_data["string"] += chr(byte_data)
            return _string

        if byte_data == 0x62:  # b
            local_data["string"] += "b"
            return _string

        if byte_data == 0x66:  # f
            local_data["string"] += "f"
            return _string

        if byte_data == 0x6e:  # n
            local_data["string"] += "n"
            return _string

        if byte_data == 0x72:  # r
            local_data["string"] += "r"
            return _string

        if byte_data == 0x74:  # t
            local_data["string"] += "t"
            return _string

        if byte_data == 0x75:  # u
            return hex_machine(on_char_code)

    def on_char_code(char_code):
        # nonlocal string
        local_data["string"] += chr(char_code)
        return _string

    return _string


# Nestable state machine for UTF-8 Decoding.
def utf8_machine(byte_data, emit):
    local_data = {"left": 0, "num": 0}

    def _utf8(byte_data):
        # nonlocal num, left
        if (byte_data & 0xc0) != 0x80:
            raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16))

        local_data["left"] -= 1

        local_data["num"] |= (byte_data & 0x3f) << (local_data["left"] * 6)
        if local_data["left"]:
            return _utf8
        return emit(local_data["num"])

    if 0xc0 <= byte_data < 0xe0:  # 2-byte UTF-8 Character
        local_data["left"] = 1
        local_data["num"] = (byte_data & 0x1f) << 6
        return _utf8

    if 0xe0 <= byte_data < 0xf0:  # 3-byte UTF-8 Character
        local_data["left"] = 2
        local_data["num"] = (byte_data & 0xf) << 12
        return _utf8

    if 0xf0 <= byte_data < 0xf8:  # 4-byte UTF-8 Character
        local_data["left"] = 3
        local_data["num"] = (byte_data & 0x07) << 18
        return _utf8

    raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data))


# Nestable state machine for hex escaped characters
def hex_machine(emit):
    local_data = {"left": 4, "num": 0}

    def _hex(byte_data):
        # nonlocal num, left
        i = 0  # Parse the hex byte
        if 0x30 <= byte_data < 0x40:
            i = byte_data - 0x30
        elif 0x61 <= byte_data <= 0x66:
            i = byte_data - 0x57
        elif 0x41 <= byte_data <= 0x46:
            i = byte_data - 0x37
        else:
            raise Exception("Expected hex char in string hex escape")

        local_data["left"] -= 1
        local_data["num"] |= i << (local_data["left"] * 4)

        if local_data["left"]:
            return _hex
        return emit(local_data["num"])

    return _hex


def number_machine(byte_data, emit):
    local_data = {"sign": 1, "number": 0, "decimal": 0, "esign": 1, "exponent": 0}

    def _mid(byte_data):
        if byte_data == 0x2e:  # .
            return _decimal

        return _later(byte_data)

    def _number(byte_data):
        # nonlocal number
        if 0x30 <= byte_data < 0x40:
            local_data["number"] = local_data["number"] * 10 + (byte_data - 0x30)
            return _number

        return _mid(byte_data)

    def _start(byte_data):
        if byte_data == 0x30:
            return _mid

        if 0x30 < byte_data < 0x40:
            return _number(byte_data)

        raise Exception("Invalid number: 0x" + byte_data.toString(16))

    if byte_data == 0x2d:  # -
        local_data["sign"] = -1
        return _start

    def _decimal(byte_data):
        # nonlocal decimal
        if 0x30 <= byte_data < 0x40:
            local_data["decimal"] = (local_data["decimal"] + byte_data - 0x30) / 10
            return _decimal

        return _later(byte_data)

    def _later(byte_data):
        if byte_data == 0x45 or byte_data == 0x65:  # E e
            return _esign

        return _done(byte_data)

    def _esign(byte_data):
        # nonlocal esign
        if byte_data == 0x2b:  # +
            return _exponent

        if byte_data == 0x2d:  # -
            local_data["esign"] = -1
            return _exponent

        return _exponent(byte_data)

    def _exponent(byte_data):
        # nonlocal exponent
        if 0x30 <= byte_data < 0x40:
            local_data["exponent"] = local_data["exponent"] * 10 + (byte_data - 0x30)
            return _exponent

        return _done(byte_data)

    def _done(byte_data):
        value = local_data["sign"] * (local_data["number"] + local_data["decimal"])
        if local_data["exponent"]:
            value *= math.pow(10, local_data["esign"] * local_data["exponent"])

        return emit(value, byte_data)

    return _start(byte_data)


def array_machine(emit):
    local_data = {"array_data": []}

    def _array(byte_data):
        if byte_data == 0x5d:  # ]
            return emit(local_data["array_data"])

        return json_machine(on_value, _comma)(byte_data)

    def on_value(value):
        # nonlocal array_data
        local_data["array_data"].append(value)

    def _comma(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _comma  # Ignore whitespace

        if byte_data == 0x2c:  # ,
            return json_machine(on_value, _comma)

        if byte_data == 0x5d:  # ]
            return emit(local_data["array_data"])

        raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body")

    return _array


def object_machine(emit):
    local_data = {"object_data": {}, "key": ""}

    def _object(byte_data):
        # nonlocal object_data, key
        if byte_data == 0x7d:  #
            return emit(local_data["object_data"])

        return _key(byte_data)

    def _key(byte_data):
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _object  # Ignore whitespace

        if byte_data == 0x22:
            return string_machine(on_key)

        raise Exception("Unexpected byte: 0x" + byte_data.toString(16))

    def on_key(result):
        # nonlocal object_data, key
        local_data["key"] = result
        return _colon

    def _colon(byte_data):
        # nonlocal object_data, key
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _colon  # Ignore whitespace

        if byte_data == 0x3a:  # :
            return json_machine(on_value, _comma)

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    def on_value(value):
        # nonlocal object_data, key
        local_data["object_data"][local_data["key"]] = value

    def _comma(byte_data):
        # nonlocal object_data
        if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20:
            return _comma  # Ignore whitespace

        if byte_data == 0x2c:  # ,
            return _key

        if byte_data == 0x7d:  #
            return emit(local_data["object_data"])

        raise Exception("Unexpected byte: 0x" + str(byte_data))

    return _object

Testing it

if __name__ == "__main__":
    test_json = """[1,2,"3"] {"name": 
    "tarun"} 1 2 
    3 [{"name":"a", 
    "data": [1,
    null,2]}]
"""
    def found_json(data):
        print(data)

    state = json_machine(found_json)

    for char in test_json:
        state = state(ord(char))

The output of the same is

[1, 2, '3']
{'name': 'tarun'}
1
2
3
[{'name': 'a', 'data': [1, None, 2]}]
Answered By: Tarun Lalwani

If you use a json.JSONDecoder instance you can use raw_decode member function. It returns a tuple of python representation of the JSON value and an index to where the parsing stopped. This makes it easy to slice (or seek in a stream object) the remaining JSON values. I’m not so happy about the extra while loop to skip over the white space between the different JSON values in the input but it gets the job done in my opinion.

import json

def yield_multiple_value(f):
    '''
    parses multiple JSON values from a file.
    '''
    vals_str = f.read()
    decoder = json.JSONDecoder()
    try:
        nread = 0
        while nread < len(vals_str):
            val, n = decoder.raw_decode(vals_str[nread:])
            nread += n
            # Skip over whitespace because of bug, below.
            while nread < len(vals_str) and vals_str[nread].isspace():
                nread += 1
            yield val
    except json.JSONDecodeError as e:
        pass
    return

The next version is much shorter and eats the part of the string that is already parsed. It seems that for some reason a second call json.JSONDecoder.raw_decode() seems to fail when the first character in the string is a whitespace, that is also the reason why I skip over the whitespace in the whileloop above …

def yield_multiple_value(f):
    '''
    parses multiple JSON values from a file.
    '''
    vals_str = f.read()
    decoder = json.JSONDecoder()
    while vals_str:
        val, n = decoder.raw_decode(vals_str)
        #remove the read characters from the start.
        vals_str = vals_str[n:]
        # remove leading white space because a second call to decoder.raw_decode()
        # fails when the string starts with whitespace, and
        # I don't understand why...
        vals_str = vals_str.lstrip()
        yield val
    return

In the documentation about the json.JSONDecoder class the method raw_decode https://docs.python.org/3/library/json.html#encoders-and-decoders contains the following:

This can be used to decode a JSON document from a string that may have
extraneous data at the end.

And this extraneous data can easily be another JSON value. In other words the method might be written with this purpose in mind.

With the input.txt using the upper function I obtain the example output as presented in the original question.

Answered By: hetepeperfan

You can use https://pypi.org/project/json-stream-parser/ for exactly that purpose.

import sys
from json_stream_parser import load_iter
for obj in load_iter(sys.stdin):
    print(obj)

output

{'foo': ['bar', 'baz']}
1
2
[]
4
5
6
Answered By: user5203

If you have control over how the data is generated, you may want to switch to an alternate format like ndjson, which stands for Newline Delimited JSON and allows to stream incremental JSON-formatted data. Each line is valid JSON on its own. There are two Python packages available: ndjson and jsonlines.

There is also json-stream which allows you to process JSON as you read it, thus avoiding having to load the entire JSON upfront. You should be able to use it to read JSON data from a stream, but also use the same stream before and after for any other I/O operations, or to read several JSON objects from the same stream.

Answered By: jlh

I did go over the answers here and did some more research a few years back, and didn’t find any solution satisfactory. So I extended jq.py to expose jq‘s native JSON stream parsing. Unfortunately, me and Michael (the upstream author and maintainer) couldn’t come to an agreement on how to best implement it (PRs are still open). So I had to keep maintaining a fork. Nevertheless it worked well in production (processing lots of KernelCI build and test result submissions) for quite a while now.

Here’s the latest rebase of the fork: https://github.com/kernelci/jq.py/tree/1.2.3.post1 (diff)

Provided you have the requisite build tools installed, you can install it with e.g.:

pip3 install jq@git+https://github.com/kernelci/[email protected]

(but maybe check if I have a fresher rebase by that time)

Here’s example usage:

#!/usr/bin/env python3
import sys
import jq
import json

def read_chunk(stream, size):
    while True:
        chunk = stream.read(size)
        if chunk:
            yield chunk
        else:
            break

for obj in jq.parse_json(text_iter=read_chunk(sys.stdin, 4096)):
    # Do something with the parsed object
    print(json.dumps(obj, indent=4))

Supposing you saved the file above as jq_stream_test, you could test it with e.g.:

./jq_stream_test <<<'{"a":1}{"b":2}{"c":3}'

The output should be:

{
    "a": 1
}
{
    "b": 2
}
{
    "c": 3
}

If you like this approach, give Michael a shout in one of my PRs. Maybe he’ll want to pick this back up, and I’ll be happy to work with him to get this merged in one shape or another.

Answered By: spbnick
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.