flatten a json using json_normalize

Question:

I have the following nested dictionary with lists inside:

[{
    "id": 467,
    "status": 2,
    "leavePeriod": {
        "owner": {
            "employeeNumber": "2620",
            "firstName": "fn_467",
            "lastName": "ln_467"
        },
        "ownerId": 46,
        "leaves": [
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-04-06T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-04-06T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-04-07T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-04-07T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-04-11T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-04-11T00:00:00"
            }
        ]
    }
},
{
    "id": 477,
    "status": 2,
    "leavePeriod": {
        "owner": {
            "employeeNumber": "2522",
            "firstName": "fn_477",
            "lastName": "lm_477"
        },
        "ownerId": 41,
        "leaves": [
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-03-13T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-03-13T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-14T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-14T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-15T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-15T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-16T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-16T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-17T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-17T00:00:00"
            }
        ]
    }
}]

How can I flatten it and convert it to a dataframe using this library?

from pandas.io.json import json_normalize

I’d like this to be done in one step if possible, instead of flattening to a dataframe twice and then merging them
Thanks in advance.

Asked By: nimi1234

||

Answers:

Just use pd.json_normalize(<put_json_here>).

Answered By: paradocslover

Using pandas.json_normalize

Note: from pandas.io.json import json_normalize results in the same output on my machine but raises a FutureWarning.

Data

import pandas as pd  # 1.5.1


data = [{
    "id": 467,
    "status": 2,
    "leavePeriod": {
        "owner": {
            "employeeNumber": "2620",
            "firstName": "fn_467",
            "lastName": "ln_467"
        },
        "ownerId": 46,
        "leaves": [
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-04-06T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-04-06T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-04-07T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-04-07T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-04-11T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-04-11T00:00:00"
            }
        ]
    }
},
{
    "id": 477,
    "status": 2,
    "leavePeriod": {
        "owner": {
            "employeeNumber": "2522",
            "firstName": "fn_477",
            "lastName": "lm_477"
        },
        "ownerId": 41,
        "leaves": [
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-03-13T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1121,
                    "name": "Vacation days 2021/2022",
                    "url": "https://some_link/1121"
                },
                "date": "2023-03-13T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-14T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-14T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-15T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-15T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-16T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-16T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-17T00:00:00"
            },
            {
                "leaveAccount": {
                    "id": 1323,
                    "name": "RTT 2023",
                    "url": "https://some_link/1323"
                },
                "date": "2023-03-17T00:00:00"
            }
        ]
    }
}]

Option 1

# write out your meta data columns and record path
df = pd.json_normalize(
    data=data,
    meta=[
        "id",
        "status",
        ["leavePeriod", "owner", "employeeNumber"],
        ["leavePeriod", "owner", "firstName"],
        ["leavePeriod", "owner", "lastName"],
        ["leavePeriod", "ownerId"]
    ],
    record_path=["leavePeriod", "leaves"]
)

Or if you’re lazy and don’t wanna split all the meta columns up by hand…

df = pd.json_normalize(
    data=data,
    meta=[
        "id",
        "status",
        "leavePeriod.owner.employeeNumber".split("."),
        "leavePeriod.owner.firstName".split("."),
        "leavePeriod.owner.lastName".split("."),
        "leavePeriod.ownerId".split("."),
    ],
    record_path="leavePeriod.leaves".split(".")
)

df.sample(5)

json_normalize out 1
json_normalize out 2

Parameter Descriptions + Explanations

  • record_path: str or list of str, default None

    Path in each object to list of records. If not passed, data will be assumed to be an array of records.

I set record_path to "leavePeriod.leaves".split(".") because that was the (nested) key(s) in the dictionary that held another layer of records, i.e. array/list of dictionaries. Leaving it blank leads pandas to assume that data is an array of records and has no further nesting.

  • meta: list of paths (str or list of str), default None

    Fields to use as metadata for each record in resulting table.

If you set record_path without setting meta pandas will only normalize and extract the record_path. To keep the upper-level (i.e. not-so-nested) keys, we need to specify them as meta data.

Option 2

df = pd.json_normalize(data)

df

output1

To "flatten" the leavePeriod.leaves even further…

# explode the column and ignore index
df = df.explode(column="leavePeriod.leaves", ignore_index=True)

df.sample(5)  # taking a sample to highlight what happened

exploded output

# convert df["leavePeriod.leaves"] to a list and use `pd.json_normalize` again
leavePeriod = pd.json_normalize(df["leavePeriod.leaves"].to_list())

leavePeriod.sample(5)

leavePeriod output

# join `leavePeriod` with `df` 
# and drop old "leavePeriod.leaves" columns
df = (
    df.join(leavePeriod)
      .drop(columns="leavePeriod.leaves")
)

df.sample(5)

Note: you can’t see all the columns in the image because it’s too wide so I have two images

joined output 1
joined output 2

Answered By: Ian Thompson
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.