How to flatten nested json data using pandas
Question:
So I wanted to flatten nested json data in this pandas data frame as additional columns. I am using requests to get json data from an api and create a pandas data frame.
import requests
import pandas as pd
import json
url = 'https://api.openaq.org/v2/locations?coordinates=38.7437396,-9.2302436&radius=5000'
api_request = requests.get(url)
api_content = api_request.content
api_json = json.loads(api_content)
df = pd.json_normalize(api_json)
df
This outputs the following:
results meta.name meta.license meta.website meta.page meta.limit meta.found
0 [{'id': 8209, 'city': 'Lisboa', 'name': 'PT031... openaq-api CC BY 4.0d api.openaq.org 1 100 4
so, I did
df = pd.json_normalize(api_json, record_path=["results"])
Which gives a nicer result:
id city name entity country sources isMobile isAnalysis parameters sensorType lastUpdated firstUpdated measurements bounds manufacturers coordinates.latitude coordinates.longitude
0 8209 Lisboa PT03100 government PT [{'url': 'http://www.eea.europa.eu/themes/air/... False False [{'id': 25414, 'unit': 'µg/m³', 'count': 10005... reference grade 2022-08-27T16:00:00+00:00 2017-09-22T11:00:00+00:00 246216 None None 38.748056 -9.202500
1 8211 Lisboa PT03084 government PT [{'url': 'http://www.eea.europa.eu/themes/air/... False False [{'id': 25397, 'unit': 'µg/m³', 'count': 93281... reference grade 2022-08-27T16:00:00+00:00 2017-09-22T10:00:00+00:00 239733 None None 38.754167 -9.230833
2 7199 Lisboa PT03087 government PT [{'url': 'http://www.eea.europa.eu/themes/air/... False False [{'id': 24800, 'unit': 'µg/m³', 'count': 97702... reference grade 2022-08-27T16:00:00+00:00 2017-09-22T13:00:00+00:00 235512 None None 38.705000 -9.210278
3 7767 Lisboa PT03082 government PT [{'url': 'http://www.eea.europa.eu/themes/air/... False False [{'id': 25393, 'unit': 'µg/m³', 'count': 83437... reference grade 2022-08-27T16:00:00+00:00 2018-02-27T01:00:00+00:00 304693 None None 38.738889 -9.207500
However, the ‘sources’ and ‘parameter’ columns are still in json format. How can I unpack these into new columns?
Also, can I for instance just pull the ‘url’ key out of the ‘sources’ json data and make it its own column?
Answers:
You can do like this. Although there are so many columns you might want to prune things up a bit before merging into main df.
df_sources = pd.json_normalize(df.sources.explode())
df_params = pd.json_normalize(df.parameters.explode())
df = pd.concat([df, df_sources, df_params], axis=1)
del df['sources'], df['parameters']
This json is quite complicated, and is more so multiple dataframes in one.
We can differentiate between the inner frames and the metadata like so:
result1 = api_json['results'][0]
# It's a frame if it's a list
frames = [x for x, y in result1.items() if isinstance(y, list)]
# Otherwise it's just metadata
metadata = [x for x in result1 if x not in frames]
Now, let’s make a list of dataframes, I’ll remove null columns as well:
dfs = [pd.json_normalize(api_json['results'], frame, metadata, record_prefix=frame + '_').dropna(how='all', axis=1) for frame in frames]
Fix up that coordinates column:
dfs = [pd.concat([df, df.coordinates.apply(pd.Series)], axis=1).drop('coordinates', axis=1) for df in dfs]
metadata += ['latitude', 'longitude']
And concat the dfs together, using the metadata as the index:
df = pd.concat([df.set_index([x for x in df.columns if x in metadata]) for df in dfs], axis=1)
print(df)
# Output:
sources_url sources_name sources_id parameters_id parameters_unit parameters_count parameters_average parameters_lastValue parameters_parameter parameters_displayName parameters_lastUpdated parameters_parameterId parameters_firstUpdated
id city name entity country isMobile isAnalysis sensorType lastUpdated firstUpdated measurements latitude longitude
7199 Lisboa PT03087 government PT False False reference grade 2022-08-27 18:00:00+00:00 2017-09-22 13:00:00+00:00 235524 38.705000 -9.210278 http://www.eea.europa.eu/themes/air/air-quality EEA eea 20728 µg/m³ 38829 0.000000 0.0 pm10 PM10 2019-12-24 16:00:00+00:00 1 2018-03-22 16:00:00+00:00
-9.210278 http://www.eea.europa.eu/themes/air/air-quality EEA eea 24802 µg/m³ 98987 53.124865 58.0 o3 O₃ mass 2022-08-27 18:00:00+00:00 3 2017-09-22 13:00:00+00:00
-9.210278 http://www.eea.europa.eu/themes/air/air-quality EEA eea 24800 µg/m³ 97708 18.870253 5.0 no2 NO₂ mass 2022-08-27 18:00:00+00:00 5 2017-09-22 13:00:00+00:00
7767 Lisboa PT03082 government PT False False reference grade 2022-08-27 18:00:00+00:00 2018-02-27 01:00:00+00:00 304711 38.738889 -9.207500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 25392 µg/m³ 85382 24.400488 12.7 no2 NO₂ mass 2022-08-27 18:00:00+00:00 5 2018-03-22 12:00:00+00:00
-9.207500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 22595 µg/m³ 68071 3.050738 5.0 so2 SO₂ mass 2022-08-27 18:00:00+00:00 6 2018-03-22 12:00:00+00:00
-9.207500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 25391 µg/m³ 67821 3528.784949 1221.0 co CO mass 2022-08-27 18:00:00+00:00 4 2018-02-27 01:00:00+00:00
-9.207500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 25393 µg/m³ 83437 44.678977 67.4 o3 O₃ mass 2022-07-25 14:00:00+00:00 3 2018-03-06 18:00:00+00:00
8209 Lisboa PT03100 government PT False False reference grade 2022-08-27 18:00:00+00:00 2017-09-22 11:00:00+00:00 246228 38.748056 -9.202500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 23940 µg/m³ 47279 6.259864 8.8 pm10 PM10 2022-08-27 12:00:00+00:00 1 2017-09-22 11:00:00+00:00
-9.202500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 24099 µg/m³ 98890 298.816190 197.0 co CO mass 2022-08-27 18:00:00+00:00 4 2017-09-22 11:00:00+00:00
-9.202500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 25414 µg/m³ 100059 32.176792 29.6 no2 NO₂ mass 2022-08-27 18:00:00+00:00 5 2017-09-22 11:00:00+00:00
8211 Lisboa PT03084 government PT False False reference grade 2022-08-27 18:00:00+00:00 2017-09-22 10:00:00+00:00 239745 38.754167 -9.230833 http://www.eea.europa.eu/themes/air/air-quality EEA eea 25398 µg/m³ 99881 57.904319 94.0 o3 O₃ mass 2022-08-27 18:00:00+00:00 3 2017-09-22 11:00:00+00:00
-9.230833 http://www.eea.europa.eu/themes/air/air-quality EEA eea 23942 µg/m³ 46577 3.805825 10.0 pm10 PM10 2022-08-27 14:00:00+00:00 1 2017-09-22 10:00:00+00:00
-9.230833 http://www.eea.europa.eu/themes/air/air-quality EEA eea 25397 µg/m³ 93287 16.333777 10.9 no2 NO₂ mass 2022-08-27 18:00:00+00:00 5 2017-09-22 11:00:00+00:00
So I wanted to flatten nested json data in this pandas data frame as additional columns. I am using requests to get json data from an api and create a pandas data frame.
import requests
import pandas as pd
import json
url = 'https://api.openaq.org/v2/locations?coordinates=38.7437396,-9.2302436&radius=5000'
api_request = requests.get(url)
api_content = api_request.content
api_json = json.loads(api_content)
df = pd.json_normalize(api_json)
df
This outputs the following:
results meta.name meta.license meta.website meta.page meta.limit meta.found
0 [{'id': 8209, 'city': 'Lisboa', 'name': 'PT031... openaq-api CC BY 4.0d api.openaq.org 1 100 4
so, I did
df = pd.json_normalize(api_json, record_path=["results"])
Which gives a nicer result:
id city name entity country sources isMobile isAnalysis parameters sensorType lastUpdated firstUpdated measurements bounds manufacturers coordinates.latitude coordinates.longitude
0 8209 Lisboa PT03100 government PT [{'url': 'http://www.eea.europa.eu/themes/air/... False False [{'id': 25414, 'unit': 'µg/m³', 'count': 10005... reference grade 2022-08-27T16:00:00+00:00 2017-09-22T11:00:00+00:00 246216 None None 38.748056 -9.202500
1 8211 Lisboa PT03084 government PT [{'url': 'http://www.eea.europa.eu/themes/air/... False False [{'id': 25397, 'unit': 'µg/m³', 'count': 93281... reference grade 2022-08-27T16:00:00+00:00 2017-09-22T10:00:00+00:00 239733 None None 38.754167 -9.230833
2 7199 Lisboa PT03087 government PT [{'url': 'http://www.eea.europa.eu/themes/air/... False False [{'id': 24800, 'unit': 'µg/m³', 'count': 97702... reference grade 2022-08-27T16:00:00+00:00 2017-09-22T13:00:00+00:00 235512 None None 38.705000 -9.210278
3 7767 Lisboa PT03082 government PT [{'url': 'http://www.eea.europa.eu/themes/air/... False False [{'id': 25393, 'unit': 'µg/m³', 'count': 83437... reference grade 2022-08-27T16:00:00+00:00 2018-02-27T01:00:00+00:00 304693 None None 38.738889 -9.207500
However, the ‘sources’ and ‘parameter’ columns are still in json format. How can I unpack these into new columns?
Also, can I for instance just pull the ‘url’ key out of the ‘sources’ json data and make it its own column?
You can do like this. Although there are so many columns you might want to prune things up a bit before merging into main df.
df_sources = pd.json_normalize(df.sources.explode())
df_params = pd.json_normalize(df.parameters.explode())
df = pd.concat([df, df_sources, df_params], axis=1)
del df['sources'], df['parameters']
This json is quite complicated, and is more so multiple dataframes in one.
We can differentiate between the inner frames and the metadata like so:
result1 = api_json['results'][0]
# It's a frame if it's a list
frames = [x for x, y in result1.items() if isinstance(y, list)]
# Otherwise it's just metadata
metadata = [x for x in result1 if x not in frames]
Now, let’s make a list of dataframes, I’ll remove null columns as well:
dfs = [pd.json_normalize(api_json['results'], frame, metadata, record_prefix=frame + '_').dropna(how='all', axis=1) for frame in frames]
Fix up that coordinates column:
dfs = [pd.concat([df, df.coordinates.apply(pd.Series)], axis=1).drop('coordinates', axis=1) for df in dfs]
metadata += ['latitude', 'longitude']
And concat the dfs together, using the metadata as the index:
df = pd.concat([df.set_index([x for x in df.columns if x in metadata]) for df in dfs], axis=1)
print(df)
# Output:
sources_url sources_name sources_id parameters_id parameters_unit parameters_count parameters_average parameters_lastValue parameters_parameter parameters_displayName parameters_lastUpdated parameters_parameterId parameters_firstUpdated
id city name entity country isMobile isAnalysis sensorType lastUpdated firstUpdated measurements latitude longitude
7199 Lisboa PT03087 government PT False False reference grade 2022-08-27 18:00:00+00:00 2017-09-22 13:00:00+00:00 235524 38.705000 -9.210278 http://www.eea.europa.eu/themes/air/air-quality EEA eea 20728 µg/m³ 38829 0.000000 0.0 pm10 PM10 2019-12-24 16:00:00+00:00 1 2018-03-22 16:00:00+00:00
-9.210278 http://www.eea.europa.eu/themes/air/air-quality EEA eea 24802 µg/m³ 98987 53.124865 58.0 o3 O₃ mass 2022-08-27 18:00:00+00:00 3 2017-09-22 13:00:00+00:00
-9.210278 http://www.eea.europa.eu/themes/air/air-quality EEA eea 24800 µg/m³ 97708 18.870253 5.0 no2 NO₂ mass 2022-08-27 18:00:00+00:00 5 2017-09-22 13:00:00+00:00
7767 Lisboa PT03082 government PT False False reference grade 2022-08-27 18:00:00+00:00 2018-02-27 01:00:00+00:00 304711 38.738889 -9.207500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 25392 µg/m³ 85382 24.400488 12.7 no2 NO₂ mass 2022-08-27 18:00:00+00:00 5 2018-03-22 12:00:00+00:00
-9.207500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 22595 µg/m³ 68071 3.050738 5.0 so2 SO₂ mass 2022-08-27 18:00:00+00:00 6 2018-03-22 12:00:00+00:00
-9.207500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 25391 µg/m³ 67821 3528.784949 1221.0 co CO mass 2022-08-27 18:00:00+00:00 4 2018-02-27 01:00:00+00:00
-9.207500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 25393 µg/m³ 83437 44.678977 67.4 o3 O₃ mass 2022-07-25 14:00:00+00:00 3 2018-03-06 18:00:00+00:00
8209 Lisboa PT03100 government PT False False reference grade 2022-08-27 18:00:00+00:00 2017-09-22 11:00:00+00:00 246228 38.748056 -9.202500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 23940 µg/m³ 47279 6.259864 8.8 pm10 PM10 2022-08-27 12:00:00+00:00 1 2017-09-22 11:00:00+00:00
-9.202500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 24099 µg/m³ 98890 298.816190 197.0 co CO mass 2022-08-27 18:00:00+00:00 4 2017-09-22 11:00:00+00:00
-9.202500 http://www.eea.europa.eu/themes/air/air-quality EEA eea 25414 µg/m³ 100059 32.176792 29.6 no2 NO₂ mass 2022-08-27 18:00:00+00:00 5 2017-09-22 11:00:00+00:00
8211 Lisboa PT03084 government PT False False reference grade 2022-08-27 18:00:00+00:00 2017-09-22 10:00:00+00:00 239745 38.754167 -9.230833 http://www.eea.europa.eu/themes/air/air-quality EEA eea 25398 µg/m³ 99881 57.904319 94.0 o3 O₃ mass 2022-08-27 18:00:00+00:00 3 2017-09-22 11:00:00+00:00
-9.230833 http://www.eea.europa.eu/themes/air/air-quality EEA eea 23942 µg/m³ 46577 3.805825 10.0 pm10 PM10 2022-08-27 14:00:00+00:00 1 2017-09-22 10:00:00+00:00
-9.230833 http://www.eea.europa.eu/themes/air/air-quality EEA eea 25397 µg/m³ 93287 16.333777 10.9 no2 NO₂ mass 2022-08-27 18:00:00+00:00 5 2017-09-22 11:00:00+00:00