Convert pandas dataframe to JSON schema

Question:

I have a dataframe

import pandas as pd

data = {
  "ID": [123123, 222222, 333333],
  "Main Authors": ["[Jim Allen, Tim H]", "[Rob Garder, Harry S, Tim H]", "[Wo Shu, Tee Ru, Fuu Wan, Gee Han]"],
  "Abstract": ["This is paper about hehe", "This paper is very nice", "Hello there paper from kellogs"],
  "paper IDs": ["[123768, 123123]", "[123432, 34345, 353545, 454545]", "[123123, 3433434, 55656655, 988899]"],
}

enter image description here

and I am trying to export it to a JSON schema. I do so via

df.to_json(orient='records')

'[{"ID":123123,"Main Authors":"[Jim Allen, Tim H]","Abstract":"This is paper about hehe","paper IDs":"[123768, 123123]"},
{"ID":222222,"Main Authors":"[Rob Garder, Harry S, Tim H]","Abstract":"This paper is very nice","paper IDs":"[123432, 34345, 353545, 454545]"},
{"ID":333333,"Main Authors":"[Wo Shu, Tee Ru, Fuu Wan, Gee Han]","Abstract":"Hello there paper from kellogs","paper IDs":"[123123, 3433434, 55656655, 988899]"}]'

but this is not in the right format for JSON. How can I get my output to look like this

{"ID": "123123", "Main Authors": ["Jim Allen", "Tim H"], "Abstract": "This is paper about hehe", "paper IDs": ["123768", "123123"]}
{and so on for paper 2...}

I can’t find an easy way to achieve this schema with the basic functions.

Asked By: keeran_q789

||

Answers:

to_json returns a proper JSON document. What you want is not a JSON document.

Add lines=True to the call:

df.to_json(orient='records', lines=True)

The output you desire is not valid JSON. It’s a very common way to stream JSON objects though: write one unindented JSON object per line.

Streaming JSON is an old technique, used to write JSON records to logs, send them over the network etc. There’s no specification for this, but a lot of people tried to hijack it, even creating sites that mirrored Douglas Crockford’s original JSON site, or mimicking the language of RFCs.

Streaming JSON formats are used a lot in IoT and event processing applications, where events will arrive over a long period of time.

PS: I remembered I saw a few months ago a question about json-seq. Seems there was an attempt to standardize streaming JSON RFC 7464 as JSON Sequences, using the mime type application/json-seq.

Answered By: Panagiotis Kanavos

You can convert DataFrame to list of dictionaries first.

import pandas as pd

data = {
  "ID": [123123, 222222, 333333],
  "Main Authors": [["Jim Allen", "Tim H"], ["Rob Garder", "Harry S", "Tim H"], ["Wo Shu", "Tee Ru", "Fuu Wan", "Gee Han"]],
  "Abstract": ["This is paper about hehe", "This paper is very nice", "Hello there paper from kellogs"],
  "paper IDs": [[123768, 123123], [123432, 34345, 353545, 454545], [123123, 3433434, 55656655, 988899]],
}
df = pd.DataFrame(data)

df.to_dict('records')

The result:

[{'ID': 123123,
  'Main Authors': ['Jim Allen', 'Tim H'],
  'Abstract': 'This is paper about hehe',
  'paper IDs': [123768, 123123]},
 {'ID': 222222,
  'Main Authors': ['Rob Garder', 'Harry S', 'Tim H'],
  'Abstract': 'This paper is very nice',
  'paper IDs': [123432, 34345, 353545, 454545]},
 {'ID': 333333,
  'Main Authors': ['Wo Shu', 'Tee Ru', 'Fuu Wan', 'Gee Han'],
  'Abstract': 'Hello there paper from kellogs',
  'paper IDs': [123123, 3433434, 55656655, 988899]}]

Is that what you are looking for?

Answered By: zalevskiaa
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.