How to change json data into dataframe

Question:

I need one help to convert json data into dataframe. Could you please help me how to do this?

Example:

JSON DATA

{
    "user_id": "vmani4",
    "password": "*****",
    "api_name": "KOL",
    "body": {
      "api_name": "KOL",
      "columns": [
        "kol_id",
        "jnj_id",
        "kol_full_nm",
        "thrc_cd"
      ],
      "filter": {
        "kol_id": "101152",
        "jnj_id": "7124166",
        "thrc_nm": "VIR"
        
      }
    }
}

Desirable output:

user_id     password       api_name     columns       filter     filter_value
vmani        ******         KOL          kol_id       kol_id       101152
                                         jnj_id       jnj_id       7124166
                                         kol_full_nm  thrc_nm      VIR
                                         thrc_cd
Asked By: shivam patel

||

Answers:

  • data will be the JSON.
  • Use pandas.json_normalize to load the JSON into a DataFrame, and drop the unneeded columns.
  • Use pandas.DataFrame.explode, to expand the 'body.columns' list into separate rows.
  • Create a separate DataFrame for data['body']['filter']
  • Use pandas.DataFrame.join to combine the two DataFrames.
  • There isn’t a way to map all of 'filter' to all 'body.columns'.
    • 'thrc_nm' doesn’t map to anything in 'body.columns'.
    • 'filter' and 'filter_value' are added as separate columns, ordered by their order in the JSON, and not associated with the 'body.columns'.
  • Tested in python 3.10, pandas 1.4.3
import pandas as pd

# load the json data
df = pd.json_normalize(data).drop(columns=['body.filter.kol_id', 'body.filter.jnj_id', 'body.filter.thrc_nm'])

# explode the column
df = df.explode('body.columns', ignore_index=True)

# load and clean data[body][filter]
df_filter = pd.DataFrame.from_dict(data['body']['filter'], orient='index').reset_index().rename(columns={'index': 'filter', 0: 'filter_value'})

# join the dataframes
dfj = df.join(df_filter)

# display(dfj)
  user_id password api_name body.api_name body.columns   filter filter_value
0  vmani4    *****      KOL           KOL       kol_id   kol_id       101152
1  vmani4    *****      KOL           KOL       jnj_id   jnj_id      7124166
2  vmani4    *****      KOL           KOL  kol_full_nm  thrc_nm          VIR
3  vmani4    *****      KOL           KOL      thrc_cd      NaN          NaN

Option

  • I think it’s easier to have each filter as a column, with the value below it
# load data into a dataframe
df = pd.json_normalize(data)

# explode the column
df = df.explode('body.columns', ignore_index=True)

# display(df)
  user_id password api_name body.api_name body.columns body.filter.kol_id body.filter.jnj_id body.filter.thrc_nm
0  vmani4    *****      KOL           KOL       kol_id             101152            7124166                 VIR
1  vmani4    *****      KOL           KOL       jnj_id             101152            7124166                 VIR
2  vmani4    *****      KOL           KOL  kol_full_nm             101152            7124166                 VIR
3  vmani4    *****      KOL           KOL      thrc_cd             101152            7124166                 VIR
Answered By: Trenton McKinney

I’m not familiar with DataFrame but I tried my best to come up with the solution of you desired output in proper way.

Code

import pandas as pd
import json
import numpy as np

json_data = """ {
    "user_id": "vmani4",
    "password": "*****",
    "api_name": "KOL",
    "body": {
      "api_name": "KOL",
      "columns": [
        "kol_id",
        "jnj_id",
        "kol_full_nm",
        "thrc_cd"
      ],
      "filter": {
        "kol_id": "101152",
        "jnj_id": "7124166",
        "thrc_nm": "VIR"
        
      }
    }
}"""

python_data = json.loads(json_data)

filter = {}
list_for_filter = []
filter_value = {}
list_for_filter_value = []
first_level = {}
for_colums = {}

for x, y in python_data.items():
    if type(y) is dict:
        for j, k in y.items():
            if j == 'columns':
                for_colums[j] = k
            if type(k) is dict:
                for m, n in k.items():
                    list_for_filter.append(m)
                    list_for_filter_value.append(n)
        break
    first_level[x] = [y]

filter['filter'] = list_for_filter
filter_value['filter_value'] = list_for_filter_value

res = {**first_level, **for_colums, **filter, **filter_value}

df = pd.concat([pd.Series(v, name=k) for k, v in res.items()], axis=1)
print(df)

output

  user_id password api_name      columns   filter filter_value
0  vmani4    *****      KOL       kol_id   kol_id       101152
1     NaN      NaN      NaN       jnj_id   jnj_id      7124166
2     NaN      NaN      NaN  kol_full_nm  thrc_nm          VIR
3     NaN      NaN      NaN      thrc_cd      NaN          NaN

Let me give you short hand about my code first created a lot of lists and dicts the reason why I did so is that I saw in your desired output some columns that weren’t actually in your code like filter_value.

I also loop trough the dict items in order to make another dict which will satisfy the desired output.

after of all because of the length of lists in the DataFrame where not equal that’s why I used concat and series

Answered By: Umutambyi Gad