How to deal with nested JSON

Question:

This is an example of one of the JSON objects given to me by the API. There are 100 of these.

[{"id": "133248644",
"associations": {"deals": {"results": [{"id": "2762673039",
                                          "type": "line_item_to_deal"}]}},
  "properties": {
            "createdate": "2020-08-06T15:05:23.253Z",
            "description": null,
            "hs_lastmodifieddate": "2020-08-06T15:05:23.253Z",
            "hs_object_id": "133248644",
            "name": "test product",
            "price": "100"},
 "createdAt": "2020-08-06T15:05:23.253Z",
 "updatedAt": "2020-08-06T15:05:23.253Z",
 "archived": false}]

I want to create a pandas dataframe that has a column for id as well as all of the properties associated with it, in addition to the id nested under "associations". Essentially I want to remove the properties from being nested under properties and the id from being nested under associations (as well as renamed). How would I go about this?

Here is a reproducable example of my attempt to solve the problem:

import json
import pandas as pd

response = """[{"id": "133248644",
"associations": {"deals": {"results": [{"id": "2762673039",
                                          "type": "line_item_to_deal"}]}},
  "properties": {
            "createdate": "2020-08-06T15:05:23.253Z",
            "description": null,
            "hs_lastmodifieddate": "2020-08-06T15:05:23.253Z",
            "hs_object_id": "133248644",
            "name": "test product",
            "price": "100"},
 "createdAt": "2020-08-06T15:05:23.253Z",
 "updatedAt": "2020-08-06T15:05:23.253Z",
 "archived": false}, 
{"id": "133345685",
 "associations": {"deals": {"results": [{"id": "2762673038",
                                          "type": "line_item_to_deal"}]}},
 "properties": {
             "createdate": 
             "2020-08-06T18:29:06.773Z", 
             "description": null,
             "hs_lastmodifieddate": "2020-08-06T18:29:06.773Z",
             "hs_object_id": "133345685",
             "name": "TEST PRODUCT 2",
             "price": "2222"},
 "createdAt": "2020-08-06T18:29:06.773Z", 
 "updatedAt": "2020-08-06T18:29:06.773Z",
 "archived": false}]"""


data = json.loads(response)
data_flat = [dict(id=x["id"], **x["properties"]) for x in data]

And this is a better solution, but still isn’t quite perfect.

data_flat = [dict(lineid=x["id"],dealid=x["associations"]["deals"]["results"][0]["id"], **x["properties"]) for x in data]

Finally, this is very useful but still requires me to extract the id from the associations column in kind of a convoluted way.

normal_data = pd.normalize_data(data)
Asked By: Jacob Bayer

||

Answers:

  • Dealing with lists of nest dicts is convoluted. There is not a readable one-liner for extracting the data.
  • Read data with pandas.json_normalize
  • associations.deals.results is a list of dicts, use pandas.DataFrame.explode to separate each dict in the list to a separate row
  • Use .json_normalize on 'associations.deals.results' to convert the dicts to columns.
  • pandas.DataFrame.join df to the normalize columns.
    • id already exists in the dataframe so id in the dict will get a right suffix, but type does not require a suffix, since it doesn’t exist in df.
  • Use pandas.DataFrame.rename to rename any desired columns.
  • Tested in python 3.10, pandas 1.4.3
import pandas as pd
import json

# convert response from a string to a list of dicts
data = json.loads(response)

# create a pandas dataframe
df = pd.json_normalize(data)

# associations.deals.results is a list of dicts, explode them
df = df.explode('associations.deals.results', ignore_index=True)

# normalize the dicts in associations.deals.results and join them back to df
df = df.join(pd.json_normalize(df.pop('associations.deals.results')), rsuffix='.associations.deals.results')

# display(df)
          id                 createdAt                 updatedAt  archived     properties.createdate properties.description properties.hs_lastmodifieddate properties.hs_object_id properties.name properties.price id.associations.deals.results               type
0  133248644  2020-08-06T15:05:23.253Z  2020-08-06T15:05:23.253Z     False  2020-08-06T15:05:23.253Z                   None       2020-08-06T15:05:23.253Z               133248644    test product              100                    2762673039  line_item_to_deal
1  133345685  2020-08-06T18:29:06.773Z  2020-08-06T18:29:06.773Z     False  2020-08-06T18:29:06.773Z                   None       2020-08-06T18:29:06.773Z               133345685  TEST PRODUCT 2             2222                    2762673038  line_item_to_deal

response

response = """[{"id": "133248644",
"associations": {"deals": {"results": [{"id": "2762673039",
                                          "type": "line_item_to_deal"}]}},
  "properties": {
            "createdate": "2020-08-06T15:05:23.253Z",
            "description": null,
            "hs_lastmodifieddate": "2020-08-06T15:05:23.253Z",
            "hs_object_id": "133248644",
            "name": "test product",
            "price": "100"},
 "createdAt": "2020-08-06T15:05:23.253Z",
 "updatedAt": "2020-08-06T15:05:23.253Z",
 "archived": false}, 
{"id": "133345685",
 "associations": {"deals": {"results": [{"id": "2762673038",
                                          "type": "line_item_to_deal"}]}},
 "properties": {
             "createdate": 
             "2020-08-06T18:29:06.773Z", 
             "description": null,
             "hs_lastmodifieddate": "2020-08-06T18:29:06.773Z",
             "hs_object_id": "133345685",
             "name": "TEST PRODUCT 2",
             "price": "2222"},
 "createdAt": "2020-08-06T18:29:06.773Z", 
 "updatedAt": "2020-08-06T18:29:06.773Z",
 "archived": false}]"""
Answered By: Trenton McKinney
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.