Flatten JSON with columns having values as a list in dataframe

Question:

I have the following data frame:

| Category | idlist                                                           |
| -------- | -----------------------------------------------------------------|
| new      | {100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']}  |
| old      |  ['99A', '99B']                                                  |

I want it to be converted into this:

| Category | id  | list                 |
| -------- | --- | -------------------- |
| new      | 100 | ['1A', '2B']         |     
| new      | 200 | ['2A', '3B']         |
| new      | 300 | ['3A', '3B', '4B']   |
| old      | -1  | ['99A', '99B']       |

The column ‘idlist’ is a json for category new with id as key and value a list of items. Then there is category that is old and it just has a list of items and since it doesn’t have any id we can default it to -1 or 0.

I was trying some examples on internet but I was not able to flatten the json for category new because the key is a number. Can this be achieved? I can handle the old separately as a subset by putting id as ‘-1’ because I really don’t need to flatten the cell corresponding to old. But is there a way to handle all this together?

As requested, I am sharing the original JSON format before I converted it into a df:

{'new' : {100: ['1-A', '2-B'], 200: ['2-A', '3-B'], 300: ['3-A', '3-B', '4-B']}, 'old' :  ['99-A', '99-B']}
Asked By: trojan horse

||

Answers:

click here

   def function1(s:pd.Series):
    global df2
    list1 = []
    var1=eval(s.idlist)
    if isinstance(var1,dict):
        for i in var1:
            list1.append({'Category':s.Category,'id':i,'list':var1[i]})
    else:
        list1.append({'Category':s.Category,'id':-1,'list':var1})
    df2=pd.concat([df2,pd.DataFrame(list1)])

df2=pd.DataFrame()
df1.apply(function1,axis=1)
df2.set_index('Category')
Answered By: G.G

For your original JSON of this:

jj = '''
{
 "new": {"100": ["1A", "2B"], "200": ["2A", "3B"], "300": ["3A", "3B", "4B"]},
 "old": ["99A", "99B"]
}
'''

I think it’s easier to pre-process the JSON rather than post-process the dataframe:

import json
import pandas as pd

dd = json.loads(jj)
ml = []
for k, v in dd.items():
    if isinstance(v, dict):
        ml += [{ 'Category' : k, 'id' : k2, 'list' : v2 } for k2, v2 in v.items()]
    else:
        ml += [{ 'Category' : k, 'id' : -1, 'list' : v }]

df = pd.DataFrame(ml)

Output:

  Category   id          list
0      new  100      [1A, 2B]
1      new  200      [2A, 3B]
2      new  300  [3A, 3B, 4B]
3      old   -1    [99A, 99B]
Answered By: Nick
  1. First isolate new and old in separate dataframes.
  2. Create new columns corresponding to each key in the idlist dictionary for df_new.
  3. Use pd.melt to "unpivot" df_new
  4. Modify df_old to match column for column with df_new,
  5. Use pd.concat to rejoin the two dataframes on axis 0 (rows).

See below:

# Recreated before you json file was attached    
import pandas as pd 
df = pd.DataFrame({'Category': ['new', 'old'], 
               'idlist': [{100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']}, ['99A', '99B']]})

# isolate new and old in separate dataframes
df_new = df[df['Category'] == "new"]
df_old = df[df['Category'] == "old"]

# Create new columns corresponding to each key in the idlist dictionary for df_new
my_dict = df_new['idlist'][0]
keys = list(my_dict.keys())
values = list(my_dict.values())

for k,v in zip(keys, values):
    df_new[k] = f"{v}" # F-string to format the values in output as "string lists"

# Drop idlist because you don't need it anymore
df_new = df_new.drop('idlist', 1)


# Use pd.melt to "unpivot" df_new
df_new = pd.melt(df_new, 
        id_vars='Category', 
        value_vars=list(df_new.columns[1:]),
        var_name='id', 
        value_name='list')

# Modify df_old to match column for column with df_new
df_old['list'] = df_old['idlist']
df_old = df_old.drop('idlist', axis = 1)
df_old['id'] = -1

# Use pd.concat to rejoin the two dataframes on axis 0 (rows)
clean_df = pd.concat([df_new, df_old], axis = 0).reset_index(drop = True)
Answered By: Benny06
  • Using pd.json_normalize, this can be accomplished without using loops.
  • The data in the OP is a dict, not a JSON string.
    • A dict or JSON from an API, that isn’t a str, can be loaded directly into pandas.
      • data = {"new": {"100": ["1-A", "2-B"], "200": ["2-A", "3-B"], "300": ["3-A", "3-B", "4-B"]}, "old": ["99-A", "99-B"]} – This is a dict
    • If the data is a str, then use data = json.loads(json_str), where json_str is your data.
      • json_str = '{"new": {"100": ["1-A", "2-B"], "200": ["2-A", "3-B"], "300": ["3-A", "3-B", "4-B"]}, "old": ["99-A", "99-B"]}' – This is a JSON string because it’s surrounded by quotes.
import json
import pandas

# 1. load data into pandas
df = pd.json_normalize(data)

# 2. melt the dataframe from wide to lone form
df = df.melt(var_name='Category', value_name='list')

# 3. split the numbers from the Category column and add -1
df[['Category', 'id']] = df.Category.str.split('.', expand=True).fillna(-1)

df views:

1.

            old     new.100     new.200          new.300
0  [99-A, 99-B]  [1-A, 2-B]  [2-A, 3-B]  [3-A, 3-B, 4-B]

2.

  Category             list
0      old     [99-A, 99-B]
1  new.100       [1-A, 2-B]
2  new.200       [2-A, 3-B]
3  new.300  [3-A, 3-B, 4-B]

3.

  Category             list   id
0      old     [99-A, 99-B]   -1
1      new       [1-A, 2-B]  100
2      new       [2-A, 3-B]  200
3      new  [3-A, 3-B, 4-B]  300
Answered By: Trenton McKinney