How to deal with JSON and nested JSON inside a DataFrame columns into new columns in Python Pandas?
Question:
I have DataFrame like below:
data type:
- COL1 – float
- COL2 – int
- COL3 – int
- COL4 – float
- COL5 – float
- COL6 – object
- COL7 – object
Source code:
a = pd.DataFrame()
a["COL1"] = [0.0, 800.0]
a["COL2"] = [2, 3]
a["COL3"] = [123, 444]
a["COL4"] = [1500.0, 1600.0]
a["COL5"] = [700.0, 850.0]
a["COL6"] = ['{"account": {"sector": 2, "other": 15}}', np.nan]
a["COL7"] = ['{"value": "ab"}', np.nan]
- COL6 and COL7 contain JSON, COL6 contains nested JSON.
- Furthermore there could be missings both in COL6 and COL7.
- And I need to convert values from COL6 and COL7 to "normal" form, however I can not even imagine how to convert COL6 (nested JSON) to DataFrame form of column with value
Desire output:
In terms of outpur for COL7 it is like below, however I can not even imagine how should look output for COL6 ?
COL1 | COL2 | COL3 | COL4 | COL5 | value |
------|------|------|--------|-------|-------|
0.0 | 2 | 123 | 1500.0 | 700.0 | abc |
800.0 | 3 | 444 | 1600.0 | 850.0 | NaN |
How can I do that in Python Pandas ?
The following solution does not work: pd.json_normalize(df['COL7'].apply(ast.literal_eval))
, ERROR: ValueError: malformed node or string: nan
Source code (be aware that if I read it in Pandas there is also NaN):
{'COL1': [0.0, 0.0, 0.0],
'COL2': [2, 0, 33],
'COL3': [2162561990, 2167912785, 599119703],
'COL4': [1500.0, 500.0, 3500.0],
'COL5': [750.0, 0.0, 3500.0],
'COL6': ['{"account": {"sector": 4, "other": 10}
, "account_2": {"sector": 0, "other": 0}
, "account_3": {"sector": 6, "other": 8}}'],
'COL7': ['{"value": "cc"
, "value_2": 15.58
, "value_3": 646}']}
Answers:
You can try something as below; where you will first try to convert json from nested to flat,
more the error you were receiving that is because of nan values, so avoid that I have you if/else condition.
Code:
import pandas as pd
import ast
import json
for col in ['COL6', 'COL7']:
a[col] = a[col].apply(lambda x: '' if pd.isnull(x) else list(pd.json_normalize(ast.literal_eval(x)).T.to_dict().values())[0])
a
#output
COL1 COL2 COL3 COL4 COL5 COL6 COL7
0 0.0 2 123 1500.0 700.0 {'account.sector': 2, 'account.other': 15} ab
1 800.0 3 444 1600.0 850.0
after flatting, I am trying to split that column and concat with our actual data.
a = pd.concat([a, a['COL6'].apply(pd.Series).drop(0,axis=1)]], axis=1)
a.columns = a.columns.str.split('.').str[-1]
Output: you will get all columns, drop the unnecessary ones.
sector other
0 2.0 15.0
1 NaN NaN
Just for the fun of it, this might be a solution as well. By restructuring the data to dictionaries in a different format:
import pandas as pd
import json
data = '''{
"COL1": [0.0, 0.0, 0.0],
"COL2": [2, 0, 33],
"COL3": [2162561990, 2167912785, 599119703],
"COL4": [1500.0, 500.0, 3500.0],
"COL5": [750.0, 0.0, 3500.0],
"COL6": [
{
"account": {"sector": 4, "other": 10},
"account_2": {"sector": 0, "other": 0},
"account_3": {"sector": 6, "other": 8}
}
],
"COL7": [
{
"value": "cc",
"value_2": 15.58,
"value_3": 646}
]
}
'''
d = json.loads(data)
d1, d2, d3 = {}, {}, {}
cols = []
for k in list(d.keys()):
if not isinstance(d[k][0], dict):
d1[k] = d[k][0]
d2[k] = d[k][1]
d3[k] = d[k][2]
else:
cols = list(d[k][0].keys())
d1[cols[0]] = d[k][0][cols[0]]
d2[cols[1]] = d[k][0][cols[1]]
d3[cols[2]] = d[k][0][cols[2]]
df = pd.concat([pd.json_normalize(d1), pd.json_normalize(d2), pd.json_normalize(d3)], ignore_index = True))
yields:
COL1 COL2 COL3 COL4 COL5 value account.sector account.other value_2 account_2.sector account_2.other value_3 account_3.sector account_3.other
0 0.0 2 2162561990 1500.0 750.0 cc 4.0 10.0 NaN NaN NaN NaN NaN NaN
1 0.0 0 2167912785 500.0 0.0 NaN NaN NaN 15.58 0.0 0.0 NaN NaN NaN
2 0.0 33 599119703 3500.0 3500.0 NaN NaN NaN NaN NaN NaN 646.0 6.0 8.0
I have DataFrame like below:
data type:
- COL1 – float
- COL2 – int
- COL3 – int
- COL4 – float
- COL5 – float
- COL6 – object
- COL7 – object
Source code:
a = pd.DataFrame()
a["COL1"] = [0.0, 800.0]
a["COL2"] = [2, 3]
a["COL3"] = [123, 444]
a["COL4"] = [1500.0, 1600.0]
a["COL5"] = [700.0, 850.0]
a["COL6"] = ['{"account": {"sector": 2, "other": 15}}', np.nan]
a["COL7"] = ['{"value": "ab"}', np.nan]
- COL6 and COL7 contain JSON, COL6 contains nested JSON.
- Furthermore there could be missings both in COL6 and COL7.
- And I need to convert values from COL6 and COL7 to "normal" form, however I can not even imagine how to convert COL6 (nested JSON) to DataFrame form of column with value
Desire output:
In terms of outpur for COL7 it is like below, however I can not even imagine how should look output for COL6 ?
COL1 | COL2 | COL3 | COL4 | COL5 | value |
------|------|------|--------|-------|-------|
0.0 | 2 | 123 | 1500.0 | 700.0 | abc |
800.0 | 3 | 444 | 1600.0 | 850.0 | NaN |
How can I do that in Python Pandas ?
The following solution does not work: pd.json_normalize(df['COL7'].apply(ast.literal_eval))
, ERROR: ValueError: malformed node or string: nan
Source code (be aware that if I read it in Pandas there is also NaN):
{'COL1': [0.0, 0.0, 0.0],
'COL2': [2, 0, 33],
'COL3': [2162561990, 2167912785, 599119703],
'COL4': [1500.0, 500.0, 3500.0],
'COL5': [750.0, 0.0, 3500.0],
'COL6': ['{"account": {"sector": 4, "other": 10}
, "account_2": {"sector": 0, "other": 0}
, "account_3": {"sector": 6, "other": 8}}'],
'COL7': ['{"value": "cc"
, "value_2": 15.58
, "value_3": 646}']}
You can try something as below; where you will first try to convert json from nested to flat,
more the error you were receiving that is because of nan values, so avoid that I have you if/else condition.
Code:
import pandas as pd
import ast
import json
for col in ['COL6', 'COL7']:
a[col] = a[col].apply(lambda x: '' if pd.isnull(x) else list(pd.json_normalize(ast.literal_eval(x)).T.to_dict().values())[0])
a
#output
COL1 COL2 COL3 COL4 COL5 COL6 COL7
0 0.0 2 123 1500.0 700.0 {'account.sector': 2, 'account.other': 15} ab
1 800.0 3 444 1600.0 850.0
after flatting, I am trying to split that column and concat with our actual data.
a = pd.concat([a, a['COL6'].apply(pd.Series).drop(0,axis=1)]], axis=1)
a.columns = a.columns.str.split('.').str[-1]
Output: you will get all columns, drop the unnecessary ones.
sector other
0 2.0 15.0
1 NaN NaN
Just for the fun of it, this might be a solution as well. By restructuring the data to dictionaries in a different format:
import pandas as pd
import json
data = '''{
"COL1": [0.0, 0.0, 0.0],
"COL2": [2, 0, 33],
"COL3": [2162561990, 2167912785, 599119703],
"COL4": [1500.0, 500.0, 3500.0],
"COL5": [750.0, 0.0, 3500.0],
"COL6": [
{
"account": {"sector": 4, "other": 10},
"account_2": {"sector": 0, "other": 0},
"account_3": {"sector": 6, "other": 8}
}
],
"COL7": [
{
"value": "cc",
"value_2": 15.58,
"value_3": 646}
]
}
'''
d = json.loads(data)
d1, d2, d3 = {}, {}, {}
cols = []
for k in list(d.keys()):
if not isinstance(d[k][0], dict):
d1[k] = d[k][0]
d2[k] = d[k][1]
d3[k] = d[k][2]
else:
cols = list(d[k][0].keys())
d1[cols[0]] = d[k][0][cols[0]]
d2[cols[1]] = d[k][0][cols[1]]
d3[cols[2]] = d[k][0][cols[2]]
df = pd.concat([pd.json_normalize(d1), pd.json_normalize(d2), pd.json_normalize(d3)], ignore_index = True))
yields:
COL1 COL2 COL3 COL4 COL5 value account.sector account.other value_2 account_2.sector account_2.other value_3 account_3.sector account_3.other
0 0.0 2 2162561990 1500.0 750.0 cc 4.0 10.0 NaN NaN NaN NaN NaN NaN
1 0.0 0 2167912785 500.0 0.0 NaN NaN NaN 15.58 0.0 0.0 NaN NaN NaN
2 0.0 33 599119703 3500.0 3500.0 NaN NaN NaN NaN NaN NaN 646.0 6.0 8.0