How to convert a JSON file and convert it to CSV keeping the headings using Dataframes

Question:

I have a file that is of type .gz and inside I have JSON objects like:

input:

{ "name":"John", "age":21, "gender":"male" }
{ "name":"Mike", "age":29, "gender":"male" }
{ "name":"Tim", "age":20, "gender":"male" }
{ "name":"Kim", "age":39, "gender":"female" }

Note: Notice there are no commas at end of each JSON obj.

I use the following to save it to a dataframe:

import pandas as pd
data_location = 's3://myBucket/myFolder'
raw_json_data = pd.read_json(data_location, lines=True)
raw_json_data.head(2)

Question: I want to convert it to CSV, maybe like this:

expected output:

name, age, gender
John, 21, male
Mike, 29, male
Tim, 20, male
Kim, 39, female

I used this but that did not work to give expected output – am I missing something?

df=pd.read_json(raw_json_data)
df.to_csv('results.csv')
Asked By: Saffik

||

Answers:

  • I have a file that is of type .gz and inside I have JSON objects is assumed to mean there is a .gz file, with a .json file inside.
  • Use pathlib methods to read the file in, and then split the rows into a list of strings
    • Path('test.json'): 'test.json()' can be the path to the file if it’s in a different directory.
  • Convert the strings to dicts with ast.literal_eval
import pandas as pd
from pathlib import Path
from ast import literal_eval

# read the file in using the pathlib methods
text = Path('test.json').read_text().split('n')

# map the strings to dicts
text = map(literal_eval, text)

# load the list of dicts into a dataframe
df = pd.DataFrame(text)

# save to a csv
df.to_csv('results.csv', index=False)

Read from the .gz file

  • Reading the lines with the json module is problematic because the data is not a properly formed .json file.
import gzip
import pandas as pd
from ast import literal_eval

# open the gzip file
with gzip.open('testing.json.gz', 'rt', encoding='UTF-8') as zipfile:
    data = [literal_eval(v.strip()) for v in zipfile]

# create the dataframe
df = pd.DataFrame(data)

# save to a csv
df.to_csv('results.csv', index=False)
Answered By: Trenton McKinney

Firstly, you can create dataframe with a column of the dictionaries

import json
from io import StringIO

df = pd.read_csv(StringIO("""
{ "name":"John", "age":21, "gender":"male" }
{ "name":"Mike", "age":29, "gender":"male" }
{ "name":"Tim", "age":20, "gender":"male" }
{ "name":"Kim", "age":39, "gender":"female" } 
"""), delimiter='|', header=None)  # instead of StringIO part, you can have the path of input file

df    
                 0
0   { "name":"John", "age":21, "gender":"male" }
1   { "name":"Mike", "age":29, "gender":"male" }
2   { "name":"Tim", "age":20, "gender":"male" }
3   { "name":"Kim", "age":39, "gender":"female" }

You can use json_normalize to convert individual dictionaries to dataframe

def func(x):
    result = pd.json_normalize(json.loads(x.iloc[0]))
    return result

result = df.apply(func, axis=1)
result
0       name  age gender
0  John  21   male 
1       name  age gender
0  Mike  29   male 
2      name  age gender
0  Tim  20   male   
3      name  age  gender
0  Kim  39   female
dtype: object

The above output would be series of dataframe and to convert it to a single dataframe you can do following

pd.concat([r for r in result], ignore_index=True)

    name    age gender
0   John    21  male
1   Mike    29  male
2   Tim     20  male
3   Kim     39  female
Answered By: ggaurav