How can I diagnose common errors in JSON data?

Question:

I have to deal with putative JSON from a lot of different sources, and a lot of the time it seems that there is a problem with the data itself. I suspect that it sometimes isn’t intended to be JSON at all; but a lot of the time it comes from a buggy tool, or it was written by hand for a quick test and has some typo in it.

Rather than ask about a specific error, I’m looking for a checklist: based on the error message, what is the most likely cause? What information is present in these error messages, and how can I use it to locate the problem in the data? Assume for these purposes that I can save the data to a temporary file for analysis, if it didn’t already come from a file.

Asked By: Karl Knechtel

||

Answers:

Foreword

The only exception explicitly raised by the decoding code is json.JSONDecodeError, so the exception type does not help diagnose problems. What’s interesting is the associated message. However, it is possible that decoding bytes to text fails, before JSON decoding can be attempted. That is a separate issue beyond the scope of this post.

It’s worth noting here that the JSON format documentation uses different terminology from Python. In particular, a portion of valid JSON data enclosed in {} is an object (not "dict") in JSON parlance, and a portion enclosed in [] is an array (not "list"). I will use JSON terminology when talking about the file contents, and Python terminology when talking about the parsed result or about data created directly by Python code.

As a general hint: use a dedicated JSON viewer to examine the file, or at least use a text editor that has some functionality to "balance" brackets (i.e., given that the insertion pointer is currently at a {, it will automatically find the matching }).

Not JSON

An error message saying Expecting value is a strong indication that the data is not intended to be JSON formatted at all. Carefully note the line and column position of the error for more information:

  • if the error occurs at line 1, column 1, it will be necessary to inspect the beginning of the file. It could be that the data is actually empty. If it starts with <, then that of course suggests XML rather than JSON.
    Otherwise, there could be some padding preceding actual JSON content. Sometimes, this is to implement a security restriction in a web environment; in other cases it’s to work around a different restriction. The latter case is called JSONP (JSON with Padding). Either way, it will be necessary to inspect the data to figure out how much should be trimmed from the beginning (and possibly also the end) before parsing.

  • other positions might be because the data is actually the repr of some native Python data structure. Data like this can often be parsed using ast.literal_eval, but it should not be considered a practical serialization format – it doesn’t interoperate well with code not written in Python, and using repr can easily produce data that can’t be recovered this way (or in any practical way).

Note some common differences between Python’s native object representations and the JSON format, to help diagnose the problem:

  • JSON uses only double quotes to surround strings; Python may also use single quotes, as well as triple-single ('''example''') or triple-double ("""example""") quotes.

  • JSON uses lowercase true and false rather than True and False to represent booleans. It uses null rather than None as a special "there is nothing here" value. It uses Infinity and NaN to represent special floating-point values, rather than inf and nan.

One subtlety: Expecting value can also indicate a trailing comma in an array or object. JSON syntax does not allow a trailing comma after listing elements or key-value pairs, although Python does. Although the comma is "extra", this will be reported as something missing (the next element or key-value pair) rather than something extraneous (the comma).


An error message saying Extra data indicates that there is more text after the end of the JSON data.

  • If the error occurs at line 2 column 1, this strongly suggests that the data is in fact in JSONL ("JSON Lines") format – a related format wherein each line of the input is a separate JSON entity (typically an object). Handling this is trivial: just iterate over lines of the input and parse each separately, and put the results in a list. For example, use a list comprehension: [json.loads(line) for line in open_json_file]. See Loading JSONL file as JSON objects for more.

  • Otherwise, the extra data could be part of JSONP padding. It can be removed before parsing; or else use the .raw_decode method of the JSONDecoder class:

    >>> import json
    >>> example = '{"key": "value"} extra'
    >>> json.loads(example) # breaks because of the extra data:
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
        return _default_decoder.decode(s)
      File "/usr/lib/python3.8/json/decoder.py", line 340, in decode
        raise JSONDecodeError("Extra data", s, end)
    json.decoder.JSONDecodeError: Extra data: line 1 column 18 (char 17)
    >>> parsed, size = json.JSONDecoder().raw_decode(example)
    >>> parsed
    {'key': 'value'}
    >>> size # amount of text that was parsed.
    16
    
  • Another possibility – especially likely if the error is on line 1, at a low number for the column position (e.g. line 1, column 10), is that the data is CSV format. For example, this was the case in Requesting raising "JSONDecodeError: Extra data".

    This can happen because the value in the "top-left cell of the spreadsheet" represented by the CSV file contains a comma. When that happens, the CSV format needs to surround that string in quotes (so that comma isn’t confused for a separator); that makes it look like valid JSON (for a JSON that only contains one string) followed by "extra data" (the comma separating that from the next "cell", along with the rest of the CSV data).

    For example, a valid CSV file could look like

    "x,y",z
    "(1, 2)",3
    

    The "x,y" is valid JSON by itself (representing exactly what one might expect), but parsing the whole thing causes an error:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
        return _default_decoder.decode(s)
      File "/usr/lib/python3.8/json/decoder.py", line 340, in decode
        raise JSONDecodeError("Extra data", s, end)
    json.decoder.JSONDecodeError: Extra data: line 1 column 6 (char 5)
    

Invalid string literals

Error messages saying any of:

  • Invalid \uXXXX escape
  • Invalid \escape
  • Unterminated string starting at
  • Invalid control character

suggest that a string in the data isn’t properly formatted, most likely due to a badly written escape code.

JSON strings can’t contain control codes in strict mode (the default for parsing), so e.g. a newline must be encoded with n. Note that the data must actually contain a backslash; when viewing a representation of the JSON data as a string, that backslash would then be doubled up (but not when, say, printing the string).

JSON doesn’t accept Python’s x or U escapes, only u. To represent characters outside the BMP, use a surrogate pair:

>>> json.loads('"\ud808\udf45"') # encodes Unicode code point 0x12345 as a surrogate pair
' '

Unlike in Python string literals, a single backslash followed by something that doesn’t make a valid escape sequence (such as a space) will not be accepted:

>>> json.loads('"\ "') # the input string has only one backslash
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid escape: line 1 column 2 (char 1)

Similarly, single-quotes must not be escaped within JSON strings, although double-quotes must be.

When debugging or testing an issue like this at the REPL, it’s important not to get confused between JSON’s escaping and Python’s.

Wrong brackets

Expecting ',' delimiter and Expecting ':' delimiter imply a mismatch between the brackets used for an object or array and the contents. For example, JSON like ["foo": "bar"] was almost certainly intended to represent an object, so it should have enclosing {} rather than []. Look at the line and character position where the error was reported, and then scan backwards to the enclosing bracket.

However, these errors can also mean exactly what they say: there might simply be a comma missing between array elements or key-value pairs, or a colon missing between a key and its value.

Invalid key

While Python allows anything hashable as a dict key, JSON requires strings for its object keys. This problem is indicated by Expecting property name enclosed in double quotes. While it could occur in hand-written JSON, it likely suggests the problem of data that was inappropriate created by using repr on a Python object. (This is especially likely if, upon checking the indicated location in the file, it appears that there is an attempt at a string key in single quotes.)

The error message Expecting property name enclosed in double quotes could also indicate a "wrong brackets" problem. In particular, if the data should be an array that contains integers, but was enclosed in {} instead of [], the parser would be expecting a double-quoted string key before anything else, and complain about the first integer in the list.

Answered By: Karl Knechtel
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.