Reading json in python separated by newlines

Question:

I am trying to read some json with the following format. A simple pd.read_json() returns ValueError: Trailing data. Adding lines=True returns ValueError: Expected object or value. I’ve tried various combinations of readlines() and load()/loads() so far without success.

Any ideas how I could get this into a dataframe?

{
    "content": "kdjfsfkjlffsdkj",
    "source": {
        "name": "jfkldsjf"
    },
    "title": "dsldkjfslj",
    "url": "vkljfklgjkdlgj"
}

{
    "content": "djlskgfdklgjkfgj",
    "source": {
        "name": "ldfjkdfjs"
    },
    "title": "lfsjdfklfldsjf",
    "url": "lkjlfggdflkjgdlf"
}
Asked By: Henry Dashwood

||

Answers:

The sample you have above isn’t valid JSON. To be valid JSON these objects need to be within a JS array ([]) and be comma separated, as follows:

[{
    "content": "kdjfsfkjlffsdkj",
    "source": {
        "name": "jfkldsjf"
    },
    "title": "dsldkjfslj",
    "url": "vkljfklgjkdlgj"
},

{
    "content": "djlskgfdklgjkfgj",
    "source": {
        "name": "ldfjkdfjs"
    },
    "title": "lfsjdfklfldsjf",
    "url": "lkjlfggdflkjgdlf"
}]

I just tried on my machine. When formatted correctly, it works

>>> pd.read_json('data.json')
            content                 source           title               url
0   kdjfsfkjlffsdkj   {'name': 'jfkldsjf'}      dsldkjfslj    vkljfklgjkdlgj
1  djlskgfdklgjkfgj  {'name': 'ldfjkdfjs'}  lfsjdfklfldsjf  lkjlfggdflkjgdlf
Answered By: SarahJessica

Another solution if you do not want to reformat your files.
Assuming your JSON is in a string called my_json you could do:

import json
import pandas as pd

splitted = my_json.split('nn')
my_list = [json.loads(e) for e in splitted]
df = pd.DataFrame(my_list)

Answered By: Silveris

Thanks for the ideas internet. None quite solved the problem in the way I needed (I had lots of newline characters in the strings themselves which meant I couldn’t split on them) but they helped point the way. In case anyone has a similar problem, this is what worked for me:

with open('path/to/original.json', 'r') as f:
    data = f.read()  
    data = data.split("}n")
    data = [d.strip() + "}" for d in data]
    data = list(filter(("}").__ne__, data))
    data = [json.loads(d) for d in data]

with open('path/to/reformatted.json', 'w') as f:
    json.dump(data, f)

df = pd.read_json('path/to/reformatted.json')
Answered By: Henry Dashwood

If you can use jq then solution is simpler:

jq -s '.' path/to/original.json > path/to/reformatted.json
Answered By: win_wave
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.