Flatten JSON with columns having values as a list in dataframe
Question:
I have the following data frame:
| Category | idlist |
| -------- | -----------------------------------------------------------------|
| new | {100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']} |
| old | ['99A', '99B'] |
I want it to be converted into this:
| Category | id | list |
| -------- | --- | -------------------- |
| new | 100 | ['1A', '2B'] |
| new | 200 | ['2A', '3B'] |
| new | 300 | ['3A', '3B', '4B'] |
| old | -1 | ['99A', '99B'] |
The column ‘idlist’ is a json for category new with id as key and value a list of items. Then there is category that is old and it just has a list of items and since it doesn’t have any id we can default it to -1 or 0.
I was trying some examples on internet but I was not able to flatten the json for category new because the key is a number. Can this be achieved? I can handle the old separately as a subset by putting id as ‘-1’ because I really don’t need to flatten the cell corresponding to old. But is there a way to handle all this together?
As requested, I am sharing the original JSON format before I converted it into a df:
{'new' : {100: ['1-A', '2-B'], 200: ['2-A', '3-B'], 300: ['3-A', '3-B', '4-B']}, 'old' : ['99-A', '99-B']}
Answers:
def function1(s:pd.Series):
global df2
list1 = []
var1=eval(s.idlist)
if isinstance(var1,dict):
for i in var1:
list1.append({'Category':s.Category,'id':i,'list':var1[i]})
else:
list1.append({'Category':s.Category,'id':-1,'list':var1})
df2=pd.concat([df2,pd.DataFrame(list1)])
df2=pd.DataFrame()
df1.apply(function1,axis=1)
df2.set_index('Category')
For your original JSON of this:
jj = '''
{
"new": {"100": ["1A", "2B"], "200": ["2A", "3B"], "300": ["3A", "3B", "4B"]},
"old": ["99A", "99B"]
}
'''
I think it’s easier to pre-process the JSON rather than post-process the dataframe:
import json
import pandas as pd
dd = json.loads(jj)
ml = []
for k, v in dd.items():
if isinstance(v, dict):
ml += [{ 'Category' : k, 'id' : k2, 'list' : v2 } for k2, v2 in v.items()]
else:
ml += [{ 'Category' : k, 'id' : -1, 'list' : v }]
df = pd.DataFrame(ml)
Output:
Category id list
0 new 100 [1A, 2B]
1 new 200 [2A, 3B]
2 new 300 [3A, 3B, 4B]
3 old -1 [99A, 99B]
- First isolate new and old in separate dataframes.
- Create new columns corresponding to each key in the idlist dictionary for df_new.
- Use pd.melt to "unpivot" df_new
- Modify df_old to match column for column with df_new,
- Use pd.concat to rejoin the two dataframes on axis 0 (rows).
See below:
# Recreated before you json file was attached
import pandas as pd
df = pd.DataFrame({'Category': ['new', 'old'],
'idlist': [{100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']}, ['99A', '99B']]})
# isolate new and old in separate dataframes
df_new = df[df['Category'] == "new"]
df_old = df[df['Category'] == "old"]
# Create new columns corresponding to each key in the idlist dictionary for df_new
my_dict = df_new['idlist'][0]
keys = list(my_dict.keys())
values = list(my_dict.values())
for k,v in zip(keys, values):
df_new[k] = f"{v}" # F-string to format the values in output as "string lists"
# Drop idlist because you don't need it anymore
df_new = df_new.drop('idlist', 1)
# Use pd.melt to "unpivot" df_new
df_new = pd.melt(df_new,
id_vars='Category',
value_vars=list(df_new.columns[1:]),
var_name='id',
value_name='list')
# Modify df_old to match column for column with df_new
df_old['list'] = df_old['idlist']
df_old = df_old.drop('idlist', axis = 1)
df_old['id'] = -1
# Use pd.concat to rejoin the two dataframes on axis 0 (rows)
clean_df = pd.concat([df_new, df_old], axis = 0).reset_index(drop = True)
- Using
pd.json_normalize
, this can be accomplished without using loops.
- The data in the OP is a
dict
, not a JSON string.
- A
dict
or JSON
from an API, that isn’t a str
, can be loaded directly into pandas.
data = {"new": {"100": ["1-A", "2-B"], "200": ["2-A", "3-B"], "300": ["3-A", "3-B", "4-B"]}, "old": ["99-A", "99-B"]}
– This is a dict
- If the data is a
str
, then use data = json.loads(json_str)
, where json_str
is your data.
json_str = '{"new": {"100": ["1-A", "2-B"], "200": ["2-A", "3-B"], "300": ["3-A", "3-B", "4-B"]}, "old": ["99-A", "99-B"]}'
– This is a JSON string because it’s surrounded by quotes.
import json
import pandas
# 1. load data into pandas
df = pd.json_normalize(data)
# 2. melt the dataframe from wide to lone form
df = df.melt(var_name='Category', value_name='list')
# 3. split the numbers from the Category column and add -1
df[['Category', 'id']] = df.Category.str.split('.', expand=True).fillna(-1)
df
views:
1.
old new.100 new.200 new.300
0 [99-A, 99-B] [1-A, 2-B] [2-A, 3-B] [3-A, 3-B, 4-B]
2.
Category list
0 old [99-A, 99-B]
1 new.100 [1-A, 2-B]
2 new.200 [2-A, 3-B]
3 new.300 [3-A, 3-B, 4-B]
3.
Category list id
0 old [99-A, 99-B] -1
1 new [1-A, 2-B] 100
2 new [2-A, 3-B] 200
3 new [3-A, 3-B, 4-B] 300
I have the following data frame:
| Category | idlist |
| -------- | -----------------------------------------------------------------|
| new | {100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']} |
| old | ['99A', '99B'] |
I want it to be converted into this:
| Category | id | list |
| -------- | --- | -------------------- |
| new | 100 | ['1A', '2B'] |
| new | 200 | ['2A', '3B'] |
| new | 300 | ['3A', '3B', '4B'] |
| old | -1 | ['99A', '99B'] |
The column ‘idlist’ is a json for category new with id as key and value a list of items. Then there is category that is old and it just has a list of items and since it doesn’t have any id we can default it to -1 or 0.
I was trying some examples on internet but I was not able to flatten the json for category new because the key is a number. Can this be achieved? I can handle the old separately as a subset by putting id as ‘-1’ because I really don’t need to flatten the cell corresponding to old. But is there a way to handle all this together?
As requested, I am sharing the original JSON format before I converted it into a df:
{'new' : {100: ['1-A', '2-B'], 200: ['2-A', '3-B'], 300: ['3-A', '3-B', '4-B']}, 'old' : ['99-A', '99-B']}
def function1(s:pd.Series):
global df2
list1 = []
var1=eval(s.idlist)
if isinstance(var1,dict):
for i in var1:
list1.append({'Category':s.Category,'id':i,'list':var1[i]})
else:
list1.append({'Category':s.Category,'id':-1,'list':var1})
df2=pd.concat([df2,pd.DataFrame(list1)])
df2=pd.DataFrame()
df1.apply(function1,axis=1)
df2.set_index('Category')
For your original JSON of this:
jj = '''
{
"new": {"100": ["1A", "2B"], "200": ["2A", "3B"], "300": ["3A", "3B", "4B"]},
"old": ["99A", "99B"]
}
'''
I think it’s easier to pre-process the JSON rather than post-process the dataframe:
import json
import pandas as pd
dd = json.loads(jj)
ml = []
for k, v in dd.items():
if isinstance(v, dict):
ml += [{ 'Category' : k, 'id' : k2, 'list' : v2 } for k2, v2 in v.items()]
else:
ml += [{ 'Category' : k, 'id' : -1, 'list' : v }]
df = pd.DataFrame(ml)
Output:
Category id list
0 new 100 [1A, 2B]
1 new 200 [2A, 3B]
2 new 300 [3A, 3B, 4B]
3 old -1 [99A, 99B]
- First isolate new and old in separate dataframes.
- Create new columns corresponding to each key in the idlist dictionary for df_new.
- Use pd.melt to "unpivot" df_new
- Modify df_old to match column for column with df_new,
- Use pd.concat to rejoin the two dataframes on axis 0 (rows).
See below:
# Recreated before you json file was attached
import pandas as pd
df = pd.DataFrame({'Category': ['new', 'old'],
'idlist': [{100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']}, ['99A', '99B']]})
# isolate new and old in separate dataframes
df_new = df[df['Category'] == "new"]
df_old = df[df['Category'] == "old"]
# Create new columns corresponding to each key in the idlist dictionary for df_new
my_dict = df_new['idlist'][0]
keys = list(my_dict.keys())
values = list(my_dict.values())
for k,v in zip(keys, values):
df_new[k] = f"{v}" # F-string to format the values in output as "string lists"
# Drop idlist because you don't need it anymore
df_new = df_new.drop('idlist', 1)
# Use pd.melt to "unpivot" df_new
df_new = pd.melt(df_new,
id_vars='Category',
value_vars=list(df_new.columns[1:]),
var_name='id',
value_name='list')
# Modify df_old to match column for column with df_new
df_old['list'] = df_old['idlist']
df_old = df_old.drop('idlist', axis = 1)
df_old['id'] = -1
# Use pd.concat to rejoin the two dataframes on axis 0 (rows)
clean_df = pd.concat([df_new, df_old], axis = 0).reset_index(drop = True)
- Using
pd.json_normalize
, this can be accomplished without using loops. - The data in the OP is a
dict
, not a JSON string.- A
dict
orJSON
from an API, that isn’t astr
, can be loaded directly into pandas.data = {"new": {"100": ["1-A", "2-B"], "200": ["2-A", "3-B"], "300": ["3-A", "3-B", "4-B"]}, "old": ["99-A", "99-B"]}
– This is adict
- If the data is a
str
, then usedata = json.loads(json_str)
, wherejson_str
is your data.json_str = '{"new": {"100": ["1-A", "2-B"], "200": ["2-A", "3-B"], "300": ["3-A", "3-B", "4-B"]}, "old": ["99-A", "99-B"]}'
– This is a JSON string because it’s surrounded by quotes.
- A
import json
import pandas
# 1. load data into pandas
df = pd.json_normalize(data)
# 2. melt the dataframe from wide to lone form
df = df.melt(var_name='Category', value_name='list')
# 3. split the numbers from the Category column and add -1
df[['Category', 'id']] = df.Category.str.split('.', expand=True).fillna(-1)
df
views:
1.
old new.100 new.200 new.300
0 [99-A, 99-B] [1-A, 2-B] [2-A, 3-B] [3-A, 3-B, 4-B]
2.
Category list
0 old [99-A, 99-B]
1 new.100 [1-A, 2-B]
2 new.200 [2-A, 3-B]
3 new.300 [3-A, 3-B, 4-B]
3.
Category list id
0 old [99-A, 99-B] -1
1 new [1-A, 2-B] 100
2 new [2-A, 3-B] 200
3 new [3-A, 3-B, 4-B] 300