Pandas read nested json
Question:
I am curious how I can use pandas to read nested json of the following structure:
{
"number": "",
"date": "01.10.2016",
"name": "R 3932",
"locations": [
{
"depTimeDiffMin": "0",
"name": "Spital am Pyhrn Bahnhof",
"arrTime": "",
"depTime": "06:32",
"platform": "2",
"stationIdx": "0",
"arrTimeDiffMin": "",
"track": "R 3932"
},
{
"depTimeDiffMin": "0",
"name": "Windischgarsten Bahnhof",
"arrTime": "06:37",
"depTime": "06:40",
"platform": "2",
"stationIdx": "1",
"arrTimeDiffMin": "1",
"track": ""
},
{
"depTimeDiffMin": "",
"name": "Linz/Donau Hbf",
"arrTime": "08:24",
"depTime": "",
"platform": "1A-B",
"stationIdx": "22",
"arrTimeDiffMin": "1",
"track": ""
}
]
}
This here keeps the array as json. I would rather prefer it to be expanded into columns.
pd.read_json("/myJson.json", orient='records')
edit
Thanks for the first answers.
I should refine my question:
A flattening of the nested attributes in the array is not mandatory.
It would be ok to just [A, B, C] concatenate the df.locations[‘name’].
My file contains multiple JSON objects (1 per line) I would like to keep number, date, name, and locations column. However, I would need to join the locations.
allLocations = ""
isFirst = True
for location in result.locations:
if isFirst:
isFirst = False
allLocations = location['name']
else:
allLocations += "; " + location['name']
allLocations
My approach here does not seem to be efficient / pandas style.
Answers:
You can use json_normalize
:
import json
with open('myJson.json') as data_file:
data = json.load(data_file)
df = pd.json_normalize(data, 'locations', ['date', 'number', 'name'],
record_prefix='locations_')
print (df)
locations_arrTime locations_arrTimeDiffMin locations_depTime
0 06:32
1 06:37 1 06:40
2 08:24 1
locations_depTimeDiffMin locations_name locations_platform
0 0 Spital am Pyhrn Bahnhof 2
1 0 Windischgarsten Bahnhof 2
2 Linz/Donau Hbf 1A-B
locations_stationIdx locations_track number name date
0 0 R 3932 R 3932 01.10.2016
1 1 R 3932 01.10.2016
2 22 R 3932 01.10.2016
EDIT:
You can use read_json
with parsing name
by DataFrame
constructor and last groupby
with apply join
:
df = pd.read_json("myJson.json")
df.locations = pd.DataFrame(df.locations.values.tolist())['name']
df = df.groupby(['date','name','number'])['locations'].apply(','.join).reset_index()
print (df)
date name number locations
0 2016-01-10 R 3932 Spital am Pyhrn Bahnhof,Windischgarsten Bahnho...
A possible alternative to pandas.json_normalize
is to build your own dataframe by extracting only the selected keys and values from the nested dictionary. The main reason for doing this is because json_normalize gets slow for very large json file (and might not always produce the output you want).
So, here is an alternative way to flatten the nested dictionary in pandas using glom
. The aim is to extract selected keys and value from the nested dictionary and save them in a separate column of the pandas dataframe (:
Here is a step by step guide: https://medium.com/@enrico.alemani/flatten-nested-dictionaries-in-pandas-using-glom-7948345c88f5
import pandas as pd
from glom import glom
from ast import literal_eval
target = {
"number": "",
"date": "01.10.2016",
"name": "R 3932",
"locations":
{
"depTimeDiffMin": "0",
"name": "Spital am Pyhrn Bahnhof",
"arrTime": "",
"depTime": "06:32",
"platform": "2",
"stationIdx": "0",
"arrTimeDiffMin": "",
"track": "R 3932"
}
}
# Import data
df = pd.DataFrame([str(target)], columns=['target'])
# Extract id keys and save value into a separate pandas column
df['id'] = df['target'].apply(lambda row: glom(literal_eval(row), 'locations.name'))
Another option if anyone finds this, as I was working through a notebook.
Read the file in as a df with
df = pd.read_json('filename.json')
df2 = pd.DataFrame.from_records(df['nest_level_1']['nest_level_2'])
Happy coding
I have a mutiline Json having one json object every line
{‘a’:’b’,’scope’:{‘eid’:123213}}
{‘a’:’d’,’scope’:{‘eid’:1343213}}
NO comma seperated. Each line is indepoendent
i used following logic to read nested structure
threshold = pd.read_json(r"/content/data.json",lines=True)
threshold = pd.read_json(r"/content/data.json",lines=True)
threshold['entityId'] = pd.DataFrame.from_records(threshold['scope'])['entityId']
threshold.head()
I am curious how I can use pandas to read nested json of the following structure:
{
"number": "",
"date": "01.10.2016",
"name": "R 3932",
"locations": [
{
"depTimeDiffMin": "0",
"name": "Spital am Pyhrn Bahnhof",
"arrTime": "",
"depTime": "06:32",
"platform": "2",
"stationIdx": "0",
"arrTimeDiffMin": "",
"track": "R 3932"
},
{
"depTimeDiffMin": "0",
"name": "Windischgarsten Bahnhof",
"arrTime": "06:37",
"depTime": "06:40",
"platform": "2",
"stationIdx": "1",
"arrTimeDiffMin": "1",
"track": ""
},
{
"depTimeDiffMin": "",
"name": "Linz/Donau Hbf",
"arrTime": "08:24",
"depTime": "",
"platform": "1A-B",
"stationIdx": "22",
"arrTimeDiffMin": "1",
"track": ""
}
]
}
This here keeps the array as json. I would rather prefer it to be expanded into columns.
pd.read_json("/myJson.json", orient='records')
edit
Thanks for the first answers.
I should refine my question:
A flattening of the nested attributes in the array is not mandatory.
It would be ok to just [A, B, C] concatenate the df.locations[‘name’].
My file contains multiple JSON objects (1 per line) I would like to keep number, date, name, and locations column. However, I would need to join the locations.
allLocations = ""
isFirst = True
for location in result.locations:
if isFirst:
isFirst = False
allLocations = location['name']
else:
allLocations += "; " + location['name']
allLocations
My approach here does not seem to be efficient / pandas style.
You can use json_normalize
:
import json
with open('myJson.json') as data_file:
data = json.load(data_file)
df = pd.json_normalize(data, 'locations', ['date', 'number', 'name'],
record_prefix='locations_')
print (df)
locations_arrTime locations_arrTimeDiffMin locations_depTime
0 06:32
1 06:37 1 06:40
2 08:24 1
locations_depTimeDiffMin locations_name locations_platform
0 0 Spital am Pyhrn Bahnhof 2
1 0 Windischgarsten Bahnhof 2
2 Linz/Donau Hbf 1A-B
locations_stationIdx locations_track number name date
0 0 R 3932 R 3932 01.10.2016
1 1 R 3932 01.10.2016
2 22 R 3932 01.10.2016
EDIT:
You can use read_json
with parsing name
by DataFrame
constructor and last groupby
with apply join
:
df = pd.read_json("myJson.json")
df.locations = pd.DataFrame(df.locations.values.tolist())['name']
df = df.groupby(['date','name','number'])['locations'].apply(','.join).reset_index()
print (df)
date name number locations
0 2016-01-10 R 3932 Spital am Pyhrn Bahnhof,Windischgarsten Bahnho...
A possible alternative to pandas.json_normalize
is to build your own dataframe by extracting only the selected keys and values from the nested dictionary. The main reason for doing this is because json_normalize gets slow for very large json file (and might not always produce the output you want).
So, here is an alternative way to flatten the nested dictionary in pandas using glom
. The aim is to extract selected keys and value from the nested dictionary and save them in a separate column of the pandas dataframe (:
Here is a step by step guide: https://medium.com/@enrico.alemani/flatten-nested-dictionaries-in-pandas-using-glom-7948345c88f5
import pandas as pd
from glom import glom
from ast import literal_eval
target = {
"number": "",
"date": "01.10.2016",
"name": "R 3932",
"locations":
{
"depTimeDiffMin": "0",
"name": "Spital am Pyhrn Bahnhof",
"arrTime": "",
"depTime": "06:32",
"platform": "2",
"stationIdx": "0",
"arrTimeDiffMin": "",
"track": "R 3932"
}
}
# Import data
df = pd.DataFrame([str(target)], columns=['target'])
# Extract id keys and save value into a separate pandas column
df['id'] = df['target'].apply(lambda row: glom(literal_eval(row), 'locations.name'))
Another option if anyone finds this, as I was working through a notebook.
Read the file in as a df with
df = pd.read_json('filename.json')
df2 = pd.DataFrame.from_records(df['nest_level_1']['nest_level_2'])
Happy coding
I have a mutiline Json having one json object every line
{‘a’:’b’,’scope’:{‘eid’:123213}}
{‘a’:’d’,’scope’:{‘eid’:1343213}}
NO comma seperated. Each line is indepoendent
i used following logic to read nested structure
threshold = pd.read_json(r"/content/data.json",lines=True)
threshold = pd.read_json(r"/content/data.json",lines=True)
threshold['entityId'] = pd.DataFrame.from_records(threshold['scope'])['entityId']
threshold.head()