Nested Json to pandas DataFrame with specific format

Question:

I need to format the contents of a Json file in a certain format in a pandas DataFrame so that I can run pandassql to transform the data and run it through a scoring model.

file = C:scoring_modeljson.js (contents of ‘file’ are below)

{
"response":{
  "version":"1.1",
  "token":"dsfgf",
   "body":{
     "customer":{
         "customer_id":"1234567",
         "verified":"true"
       },
     "contact":{
         "email":"[email protected]",
         "mobile_number":"0123456789"
      },
     "personal":{
         "gender": "m",
         "title":"Dr.",
         "last_name":"Muster",
         "first_name":"Max",
         "family_status":"single",
         "dob":"1985-12-23",
     }
   }
 }

I need the dataframe to look like this (obviously all values on same row, tried to format it best as possible for this question):

version | token | customer_id | verified | email      | mobile_number | gender |
1.1     | dsfgf | 1234567     | true     | [email protected] | 0123456789    | m      |

title | last_name | first_name |family_status | dob
Dr.   | Muster    | Max        | single       | 23.12.1985

I have looked at all the other questions on this topic, have tried various ways to load Json file into pandas

with open(r'C:scoring_modeljson.js', 'r') as f:
    c = pd.read_json(f.read())

with open(r'C:scoring_modeljson.js', 'r') as f:
    c = f.readlines()

tried pd.Panel() in this solution Python Pandas: How to split a sorted dictionary in a column of a dataframe with dataframe results from [yo = f.readlines()]. I thought about trying to split contents of each cell based on ("") and find a way to put the split contents into different columns but no luck so far.

Asked By: figgy

||

Answers:

If you load in the entire json as a dict (or list) e.g. using json.load, you can use json_normalize:

In [11]: d = {"response": {"body": {"contact": {"email": "[email protected]", "mobile_number": "0123456789"}, "personal": {"last_name": "Muster", "gender": "m", "first_name": "Max", "dob": "1985-12-23", "family_status": "single", "title": "Dr."}, "customer": {"verified": "true", "customer_id": "1234567"}}, "token": "dsfgf", "version": "1.1"}}

In [12]: df = pd.json_normalize(d)

In [13]: df.columns = df.columns.map(lambda x: x.split(".")[-1])

In [14]: df
Out[14]:
        email mobile_number customer_id verified         dob family_status first_name gender last_name title  token version
0  [email protected]    0123456789     1234567     true  1985-12-23        single        Max      m    Muster   Dr.  dsfgf     1.1
Answered By: Andy Hayden

It’s much easier if you deserialize the JSON using the built-in json module first (instead of pd.read_json()) and then flatten it using pd.json_normalize().

# deserialize
with open(r'C:scoring_modeljson.js', 'r') as f:
    data = json.load(f)

# flatten
df = pd.json_normalize(d)

If a dictionary is passed to json_normalize(), it’s flattened into a single row, but if a list is passed to it, it’s flattened into multiple rows. So if the nested structure contains only key-value pairs, pd.json_normalize() with no parameters suffices to flatten it.


However, if the data contains a list (JSON array in the nesting in the file), then passing record_path= argument to let pandas find the path to the records. For example, if the data is like the following (notice how the value under "body" is a list, i.e. a list of records):

data = {
    "response":[
        {
            "version":"1.1",
            "customer": {"id": "1234567", "verified":"true"},
            "body":[
                {"email":"[email protected]", "mobile_number":"0123456789"},
                {"email":"[email protected]", "mobile_number":"9876543210"}
            ]
        }, 
        {
            "version":"1.2",
            "customer": {"id": "0987654", "verified":"true"},
            "body":[
                {"email":"[email protected]", "mobile_number":"9999999999"}
            ]
        }
    ]
}

then you can pass record_path= to let the program know that the records are under "body" and pass meta= to set the path to the metadata. Note how in "body", "version" and "customer" are in the same level in the data but "id" is nested one level more so you need to pass a list to get the value under "id".

df = pd.json_normalize(data['response'], record_path=['body'], meta=['version', ['customer', 'id']])

res

Answered By: cottontail
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.