Nested Json to pandas DataFrame with specific format
Question:
I need to format the contents of a Json file in a certain format in a pandas DataFrame so that I can run pandassql to transform the data and run it through a scoring model.
file = C:scoring_modeljson.js
(contents of ‘file’ are below)
{
"response":{
"version":"1.1",
"token":"dsfgf",
"body":{
"customer":{
"customer_id":"1234567",
"verified":"true"
},
"contact":{
"email":"[email protected]",
"mobile_number":"0123456789"
},
"personal":{
"gender": "m",
"title":"Dr.",
"last_name":"Muster",
"first_name":"Max",
"family_status":"single",
"dob":"1985-12-23",
}
}
}
I need the dataframe to look like this (obviously all values on same row, tried to format it best as possible for this question):
version | token | customer_id | verified | email | mobile_number | gender |
1.1 | dsfgf | 1234567 | true | [email protected] | 0123456789 | m |
title | last_name | first_name |family_status | dob
Dr. | Muster | Max | single | 23.12.1985
I have looked at all the other questions on this topic, have tried various ways to load Json file into pandas
with open(r'C:scoring_modeljson.js', 'r') as f:
c = pd.read_json(f.read())
with open(r'C:scoring_modeljson.js', 'r') as f:
c = f.readlines()
tried pd.Panel()
in this solution Python Pandas: How to split a sorted dictionary in a column of a dataframe with dataframe results from [yo = f.readlines()]
. I thought about trying to split contents of each cell based on ("")
and find a way to put the split contents into different columns but no luck so far.
Answers:
If you load in the entire json as a dict (or list) e.g. using json.load
, you can use json_normalize
:
In [11]: d = {"response": {"body": {"contact": {"email": "[email protected]", "mobile_number": "0123456789"}, "personal": {"last_name": "Muster", "gender": "m", "first_name": "Max", "dob": "1985-12-23", "family_status": "single", "title": "Dr."}, "customer": {"verified": "true", "customer_id": "1234567"}}, "token": "dsfgf", "version": "1.1"}}
In [12]: df = pd.json_normalize(d)
In [13]: df.columns = df.columns.map(lambda x: x.split(".")[-1])
In [14]: df
Out[14]:
email mobile_number customer_id verified dob family_status first_name gender last_name title token version
0 [email protected] 0123456789 1234567 true 1985-12-23 single Max m Muster Dr. dsfgf 1.1
It’s much easier if you deserialize the JSON using the built-in json
module first (instead of pd.read_json()
) and then flatten it using pd.json_normalize()
.
# deserialize
with open(r'C:scoring_modeljson.js', 'r') as f:
data = json.load(f)
# flatten
df = pd.json_normalize(d)
If a dictionary is passed to json_normalize()
, it’s flattened into a single row, but if a list is passed to it, it’s flattened into multiple rows. So if the nested structure contains only key-value pairs, pd.json_normalize()
with no parameters suffices to flatten it.
However, if the data contains a list (JSON array in the nesting in the file), then passing record_path=
argument to let pandas find the path to the records. For example, if the data is like the following (notice how the value under "body"
is a list, i.e. a list of records):
data = {
"response":[
{
"version":"1.1",
"customer": {"id": "1234567", "verified":"true"},
"body":[
{"email":"[email protected]", "mobile_number":"0123456789"},
{"email":"[email protected]", "mobile_number":"9876543210"}
]
},
{
"version":"1.2",
"customer": {"id": "0987654", "verified":"true"},
"body":[
{"email":"[email protected]", "mobile_number":"9999999999"}
]
}
]
}
then you can pass record_path=
to let the program know that the records are under "body"
and pass meta=
to set the path to the metadata. Note how in "body"
, "version"
and "customer"
are in the same level in the data but "id"
is nested one level more so you need to pass a list to get the value under "id"
.
df = pd.json_normalize(data['response'], record_path=['body'], meta=['version', ['customer', 'id']])
I need to format the contents of a Json file in a certain format in a pandas DataFrame so that I can run pandassql to transform the data and run it through a scoring model.
file = C:scoring_modeljson.js
(contents of ‘file’ are below)
{
"response":{
"version":"1.1",
"token":"dsfgf",
"body":{
"customer":{
"customer_id":"1234567",
"verified":"true"
},
"contact":{
"email":"[email protected]",
"mobile_number":"0123456789"
},
"personal":{
"gender": "m",
"title":"Dr.",
"last_name":"Muster",
"first_name":"Max",
"family_status":"single",
"dob":"1985-12-23",
}
}
}
I need the dataframe to look like this (obviously all values on same row, tried to format it best as possible for this question):
version | token | customer_id | verified | email | mobile_number | gender |
1.1 | dsfgf | 1234567 | true | [email protected] | 0123456789 | m |
title | last_name | first_name |family_status | dob
Dr. | Muster | Max | single | 23.12.1985
I have looked at all the other questions on this topic, have tried various ways to load Json file into pandas
with open(r'C:scoring_modeljson.js', 'r') as f:
c = pd.read_json(f.read())
with open(r'C:scoring_modeljson.js', 'r') as f:
c = f.readlines()
tried pd.Panel()
in this solution Python Pandas: How to split a sorted dictionary in a column of a dataframe with dataframe results from [yo = f.readlines()]
. I thought about trying to split contents of each cell based on ("")
and find a way to put the split contents into different columns but no luck so far.
If you load in the entire json as a dict (or list) e.g. using json.load
, you can use json_normalize
:
In [11]: d = {"response": {"body": {"contact": {"email": "[email protected]", "mobile_number": "0123456789"}, "personal": {"last_name": "Muster", "gender": "m", "first_name": "Max", "dob": "1985-12-23", "family_status": "single", "title": "Dr."}, "customer": {"verified": "true", "customer_id": "1234567"}}, "token": "dsfgf", "version": "1.1"}}
In [12]: df = pd.json_normalize(d)
In [13]: df.columns = df.columns.map(lambda x: x.split(".")[-1])
In [14]: df
Out[14]:
email mobile_number customer_id verified dob family_status first_name gender last_name title token version
0 [email protected] 0123456789 1234567 true 1985-12-23 single Max m Muster Dr. dsfgf 1.1
It’s much easier if you deserialize the JSON using the built-in json
module first (instead of pd.read_json()
) and then flatten it using pd.json_normalize()
.
# deserialize
with open(r'C:scoring_modeljson.js', 'r') as f:
data = json.load(f)
# flatten
df = pd.json_normalize(d)
If a dictionary is passed to json_normalize()
, it’s flattened into a single row, but if a list is passed to it, it’s flattened into multiple rows. So if the nested structure contains only key-value pairs, pd.json_normalize()
with no parameters suffices to flatten it.
However, if the data contains a list (JSON array in the nesting in the file), then passing record_path=
argument to let pandas find the path to the records. For example, if the data is like the following (notice how the value under "body"
is a list, i.e. a list of records):
data = {
"response":[
{
"version":"1.1",
"customer": {"id": "1234567", "verified":"true"},
"body":[
{"email":"[email protected]", "mobile_number":"0123456789"},
{"email":"[email protected]", "mobile_number":"9876543210"}
]
},
{
"version":"1.2",
"customer": {"id": "0987654", "verified":"true"},
"body":[
{"email":"[email protected]", "mobile_number":"9999999999"}
]
}
]
}
then you can pass record_path=
to let the program know that the records are under "body"
and pass meta=
to set the path to the metadata. Note how in "body"
, "version"
and "customer"
are in the same level in the data but "id"
is nested one level more so you need to pass a list to get the value under "id"
.
df = pd.json_normalize(data['response'], record_path=['body'], meta=['version', ['customer', 'id']])