Read nested JSON in a cell with Pandas

Question:

I recovered a list of ISO 3166-2 countries and regions in this Github repository.
I managed to have a first look of the regions using the following code:

import pandas as pd
import json
data = "/content/data.json"
df = pd.read_json(data)
df = df.T

Which gives the following output:

name divisions
Afghanistan {‘AF-BDS’: ‘Badakhshān’, ‘AF-BDG’: ‘Bādghīs’, ‘AF-BGL’: ‘Baghlān’, ‘AF-BAL’: ‘Balkh’, ‘AF-BAM’: ‘Bāmīān’, ‘AF-FRA’: ‘Farāh’, ‘AF-FYB’: ‘Fāryāb’, ‘AF-GHA’: ‘Ghaznī’, ‘AF-GHO’: ‘Ghowr’, ‘AF-HEL’: ‘Helmand’, ‘AF-HER’: ‘Herāt’, ‘AF-JOW’: ‘Jowzjān’, ‘AF-KAB’: ‘Kabul (Kābol)’, ‘AF-KAN’: ‘Kandahār’, ‘AF-KAP’: ‘Kāpīsā’, ‘AF-KNR’: ‘Konar (Kunar)’, ‘AF-KDZ’: ‘Kondoz (Kunduz)’, ‘AF-LAG’: ‘Laghmān’, ‘AF-LOW’: ‘Lowgar’, ‘AF-NAN’: ‘Nangrahār (Nangarhār)’, ‘AF-NIM’: ‘Nīmrūz’, ‘AF-ORU’: ‘Orūzgān (Urūzgā’, ‘AF-PIA’: ‘Paktīā’, ‘AF-PKA’: ‘Paktīkā’, ‘AF-PAR’: ‘Parwān’, ‘AF-SAM’: ‘Samangān’, ‘AF-SAR’: ‘Sar-e Pol’, ‘AF-TAK’: ‘Takhār’, ‘AF-WAR’: ‘Wardak (Wardag)’, ‘AF-ZAB’: ‘Zābol (Zābul)’}
Albania {‘AL-BR’: ‘Berat’, ‘AL-BU’: ‘Bulqizë’, ‘AL-DL’: ‘Delvinë’, ‘AL-DV’: ‘Devoll’, ‘AL-DI’: ‘Dibër’, ‘AL-DR’: ‘Durrës’, ‘AL-EL’: ‘Elbasan’, ‘AL-FR’: ‘Fier’, ‘AL-GR’: ‘Gramsh’, ‘AL-GJ’: ‘Gjirokastër’, ‘AL-HA’: ‘Has’, ‘AL-KA’: ‘Kavajë’, ‘AL-ER’: ‘Kolonjë’, ‘AL-KO’: ‘Korcë’, ‘AL-KR’: ‘Krujë’, ‘AL-KC’: ‘Kucovë’, ‘AL-KU’: ‘Kukës’, ‘AL-LA’: ‘Laç’, ‘AL-LE’: ‘Lezhë’, ‘AL-LB’: ‘Librazhd’, ‘AL-LU’: ‘Lushnjë’, ‘AL-MM’: ‘Malësia e Madhe’, ‘AL-MK’: ‘Mallakastër’, ‘AL-MT’: ‘Mat’, ‘AL-MR’: ‘Mirditë’, ‘AL-PQ’: ‘Peqin’, ‘AL-PR’: ‘Përmet’, ‘AL-PG’: ‘Pogradec’, ‘AL-PU’: ‘Pukë’, ‘AL-SR’: ‘Sarandë’, ‘AL-SK’: ‘Skrapar’, ‘AL-SH’: ‘Shkodër’, ‘AL-TE’: ‘Tepelenë’, ‘AL-TR’: ‘Tiranë’, ‘AL-TP’: ‘Tropojë’, ‘AL-VL’: ‘Vlorë’}

But I can’t manage to achieve the following output because of the nested JSON.

country code country name region code region name
AF Afghanistan AF-BDS Badakhshān
AF Afghanistan AF-BDG Bādghīs

I tried to loop inside the DataFrame with :

df = json_normalize(df['divisions']).unstack().apply(pd.Series)

But I’m not getting any satisfying result.

Asked By: Vincent

||

Answers:

This should work:

df1 = (
  pd.DataFrame(data)
  .transpose()
  .reset_index(names="country code")
  .rename(columns={"name": "country name"})
  )

divisions = [(k1, v1) for k, v in df1["divisions"].to_dict().items() for k1, v1 in v.items()]
df2 = pd.DataFrame(divisions, columns=["region code", "region name"])

final_df = (
  pd
  .merge(df1.explode("divisions"), df2, left_on="divisions", right_on="region code")
  .drop(columns="divisions")
)

print(final_df.head(10))

  country code country name region code region name
0           AF  Afghanistan      AF-BDS  Badakhshān
1           AF  Afghanistan      AF-BDG     Bādghīs
2           AF  Afghanistan      AF-BGL     Baghlān
3           AF  Afghanistan      AF-BAL       Balkh
4           AF  Afghanistan      AF-BAM      Bāmīān
5           AF  Afghanistan      AF-FRA       Farāh
6           AF  Afghanistan      AF-FYB      Fāryāb
7           AF  Afghanistan      AF-GHA      Ghaznī
8           AF  Afghanistan      AF-GHO       Ghowr
9           AF  Afghanistan      AF-HEL     Helmand
Answered By: Jason Baker

you can simply read in the data one country at a time

J = json.load(open("iso-3166-2.json","r"))
dfs = []
for country_code in J:
    df = pd.DataFrame(J[country_code])
    df.index.name="region_code"
    df['country_code'] = country_code
    dfs.append(df)
df = pd.concat(dfs).reset_index()

#     region_code         name           divisions country_code
#0         AF-BAL  Afghanistan               Balkh           AF
#1         AF-BAM  Afghanistan              Bāmīān           AF
#2         AF-BDG  Afghanistan             Bādghīs           AF
#3         AF-BDS  Afghanistan          Badakhshān           AF
#4         AF-BGL  Afghanistan             Baghlān           AF
#...          ...          ...                 ...          ...
#3802       ZW-MI     Zimbabwe            Midlands           ZW
#3803       ZW-MN     Zimbabwe  Matabeleland North           ZW
#3804       ZW-MS     Zimbabwe  Matabeleland South           ZW
#3805       ZW-MV     Zimbabwe            Masvingo           ZW
#3806       ZW-MW     Zimbabwe    Mashonaland West           ZW
Answered By: dermen

Let’s do it in the logic of the original post:

(
    pd.read_json('iso-3166-2.json', orient='index')
    .set_index('name', append=True)
    .squeeze()
    .apply(pd.Series)
    .stack()
    .rename_axis(['country code','country name','region code'])
    .rename('region name')
    .reset_index()
)

Some notes:

  • orient='index' – read data with country codes as index, so transposition is not required
  • set_index('name', append=True) – save country codes and names together as a multy index
  • instead of squeeze we could use ['divisions'].apply
  • .apply(pd.Series) – transform dictionaries in divisions into records with the region codes as column names
  • .stack() – unpivot the table with the region codes in a columns to long format
  • .rename_axis(...) – at this stage contry codes, names and region codes make up a multyindex of a series with region names as values
Answered By: Vitalizzare
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.