Python conversion from JSON to JSONL

Question:

I wish to manipulate a standard JSON object to an object where each line must contain a separate, self-contained valid JSON object. See JSON Lines

JSON_file =

[{u'index': 1,
  u'no': 'A',
  u'met': u'1043205'},
 {u'index': 2,
  u'no': 'B',
  u'met': u'000031043206'},
 {u'index': 3,
  u'no': 'C',
  u'met': u'0031043207'}]

To JSONL:

{u'index': 1, u'no': 'A', u'met': u'1043205'}
{u'index': 2, u'no': 'B', u'met': u'031043206'}
{u'index': 3, u'no': 'C', u'met': u'0031043207'}

My current solution is to read the JSON file as a text file and remove the [ from the beginning and the ] from the end. Thus, creating a valid JSON object on each line, rather than a nested object containing lines.

I wonder if there is a more elegant solution? I suspect something could go wrong using string manipulation on the file.

The motivation is to read json files into RDD on Spark. See related question – Reading JSON with Apache Spark – `corrupt_record`

Asked By: LearningSlowly

||

Answers:

Your input appears to be a sequence of Python objects; it certainly is not valid a JSON document.

If you have a list of Python dictionaries, then all you have to do is dump each entry into a file separately, followed by a newline:

import json

with open('output.jsonl', 'w') as outfile:
    for entry in JSON_file:
        json.dump(entry, outfile)
        outfile.write('n')

The default configuration for the json module is to output JSON without newlines embedded.

Assuming your A, B and C names are really strings, that would produce:

{"index": 1, "met": "1043205", "no": "A"}
{"index": 2, "met": "000031043206", "no": "B"}
{"index": 3, "met": "0031043207", "no": "C"}

If you started with a JSON document containing a list of entries, just parse that document first with json.load()/json.loads().

Answered By: Martijn Pieters

the jsonlines package is made exactly for your use case:

import jsonlines

items = [
    {'a': 1, 'b': 2},
    {'a', 123, 'b': 456},
]
with jsonlines.open('output.jsonl', 'w') as writer:
    writer.write_all(items)

(yes, i wrote it years after you posted your original question.)

Answered By: wouter bolsterlee

A simple way to do this is with jq command in your terminal.

To install jq on Debian and derivatives:

$ sudo apt-get install jq

CentOS/RHEL users should run:

$ sudo yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
$ sudo yum install jq -y

Basic usage:

$ jq -c '.[]' some_json.json >> output.jsonl

If you need to handle with huge files, i strongly recommend to use --stream flag. This will make jq parse your json in streaming mode.

$ jq -c --stream '.[]' some_json.json >> output.json

But, if you need to do this operation into a python file, you can use bigjson , a useful library that parses the JSON in streaming mode:

$ pip3 install bigjson

To read a huge json (In my case, it was 40 GB):

import bigjson

# Reads json file in streaming mode
with open('input_file.json', 'rb') as f:
    json_data = bigjson.load(f)

    # Open output file  
    with open('output_file.jsonl', 'w') as outfile:
        # Iterates over input json
        for data in json_data:
            # Converts json to a Python dict  
            dict_data = data.to_python()
            
            # Saves the output to output file
            outfile.write(json.dumps(dict_data)+"n")

If you want, try to parallelize this code aiming to improve performance. Post the result here 🙂

Documentation and source code: https://github.com/henu/bigjson

Answered By: Jonas Ferreira

If you don’t want a library, it’s easy enough to do using json directly.

source

[
    {"index": 1,"no": "A","met": "1043205"},
    {"index": 2,"no": "B","met": "000031043206"},
    {"index": 3,"no": "C","met": "0031043207"}
]

code

import json

with open("test.json", 'r') as infile:
    data = json.load(infile)
    if len(data) > 0:
        print(json.dumps([t for t in data[0]]))
        for record in data:
            print(json.dumps([v for (k,v) in record.items()]))

result

["index", "no", "met"]
[1, "A", "1043205"]
[2, "B", "000031043206"]
[3, "C", "0031043207"]
Answered By: Konchog

Note that jsonl is a compacted json. You may need to pass separators without spaces:

with open(output_file_jsonl, 'a', encoding ='utf8') as json_file:
    for elem in rs:
        json_file.write(json.dumps(dict(elem), separators=(',', ':'), cls=DateTimeEncoder))
        json_file.write('n')
Answered By: Laya

This is an edit to this answer which takes into account the possibility of special symbols or using a different alphabet in the JSONL file. For example, I use Cyrillic and without the encoding and ensure_ascii parameters edited, I get really ugly results. I thought it could be useful:

with open('output.jsonl', 'w', encoding='utf8') as outfile:
    for entry in dataset_donut:
        json.dump(entry, outfile, ensure_ascii=False)
        outfile.write('n')
Answered By: Yana
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.