How to merge non-fixed key json multilines into one json abstractly

Question:

If I have a heavy json file that have 30m entries like that

{"id":3,"price":"231","type":"Y","location":"NY"}
{"id":4,"price":"321","type":"N","city":"BR"}
{"id":5,"price":"354","type":"Y","city":"XE","location":"CP"}
--snip--
{"id":30373779,"price":"121","type":"N","city":"SR","location":"IU"}
{"id":30373780,"price":"432","type":"Y","location":"TB"}
{"id":30373780,"price":"562","type":"N","city":"CQ"}

how I can only abstract the location and the city and parse it into one json like that in python:

{
    "orders":{
        3:{
            "location":"NY"
        },
        4:{
            "city":"BR"
        },
        5:{
            "city":"XE",
            "location":"CP"
        },
        30373779:{
            "city":"SR",
            "location":"IU"
        },
        30373780:{
            "location":"TB"
        },
        30373780:{
            "city":"CQ"
        }
    }
}

P.S: beatufy the syntax is not necessary.

Asked By: Int Ver

||

Answers:

Assuming your input file is actually in jsonlines format, then you can read each line, extract the city and location keys from the dict and then append those to a new dict:

import json
from collections import defaultdict

orders = { 'orders' : defaultdict(dict) }
with open('orders.txt', 'r') as f:
    for line in f:
        o = json.loads(line)
        id = o['id']
        if 'location' in o:
            orders['orders'][id]['location'] = o['location'] 
        if 'city' in o:
            orders['orders'][id]['city'] = o['city'] 

print(orders)

Output for your sample data (note it has two 30373780 id values, so the values get merged into one dict):

{
    "orders": {
        "3": {
            "location": "NY"
        },
        "4": {
            "city": "BR"
        },
        "5": {
            "location": "CP",
            "city": "XE"
        },
        "30373779": {
            "location": "IU",
            "city": "SR"
        },
        "30373780": {
            "location": "TB",
            "city": "CQ"
        }
    }
}
Answered By: Nick

As you’ve said that your file is pretty big and you probably don’t want to keep all entries in memory here is the way to consume source file line by line and write output immediately:

import json

with open(r"in.jsonp") as i_f, open(r"out.json", "w") as o_f:
    o_f.write('{"orders":{')
    for i in i_f:
        i_obj = json.loads(i)
        o_f.write(f'{i_obj["id"]}:')
        o_obj = {}
        if location := i_obj.get("location"):
            o_obj["location"] = location
        if city := i_obj.get("city"):
            o_obj["city"] = city
        json.dump(o_obj, o_f)
        o_f.write(",")
    o_f.write('}}')

It will generate semi-valid JSON object in same format you’ve provided in your question.

Answered By: Olvin Roght
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.