flatten a json using json_normalize
Question:
I have the following nested dictionary with lists inside:
[{
"id": 467,
"status": 2,
"leavePeriod": {
"owner": {
"employeeNumber": "2620",
"firstName": "fn_467",
"lastName": "ln_467"
},
"ownerId": 46,
"leaves": [
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-06T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-06T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-07T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-07T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-11T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-11T00:00:00"
}
]
}
},
{
"id": 477,
"status": 2,
"leavePeriod": {
"owner": {
"employeeNumber": "2522",
"firstName": "fn_477",
"lastName": "lm_477"
},
"ownerId": 41,
"leaves": [
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-03-13T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-03-13T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-14T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-14T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-15T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-15T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-16T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-16T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-17T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-17T00:00:00"
}
]
}
}]
How can I flatten it and convert it to a dataframe using this library?
from pandas.io.json import json_normalize
I’d like this to be done in one step if possible, instead of flattening to a dataframe twice and then merging them
Thanks in advance.
Answers:
Just use pd.json_normalize(<put_json_here>)
.
Using pandas.json_normalize
Note: from pandas.io.json import json_normalize
results in the same output on my machine but raises a FutureWarning
.
Data
import pandas as pd # 1.5.1
data = [{
"id": 467,
"status": 2,
"leavePeriod": {
"owner": {
"employeeNumber": "2620",
"firstName": "fn_467",
"lastName": "ln_467"
},
"ownerId": 46,
"leaves": [
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-06T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-06T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-07T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-07T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-11T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-11T00:00:00"
}
]
}
},
{
"id": 477,
"status": 2,
"leavePeriod": {
"owner": {
"employeeNumber": "2522",
"firstName": "fn_477",
"lastName": "lm_477"
},
"ownerId": 41,
"leaves": [
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-03-13T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-03-13T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-14T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-14T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-15T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-15T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-16T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-16T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-17T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-17T00:00:00"
}
]
}
}]
Option 1
# write out your meta data columns and record path
df = pd.json_normalize(
data=data,
meta=[
"id",
"status",
["leavePeriod", "owner", "employeeNumber"],
["leavePeriod", "owner", "firstName"],
["leavePeriod", "owner", "lastName"],
["leavePeriod", "ownerId"]
],
record_path=["leavePeriod", "leaves"]
)
Or if you’re lazy and don’t wanna split all the meta columns up by hand…
df = pd.json_normalize(
data=data,
meta=[
"id",
"status",
"leavePeriod.owner.employeeNumber".split("."),
"leavePeriod.owner.firstName".split("."),
"leavePeriod.owner.lastName".split("."),
"leavePeriod.ownerId".split("."),
],
record_path="leavePeriod.leaves".split(".")
)
df.sample(5)
Parameter Descriptions + Explanations
record_path
: str or list of str, default None
Path in each object to list of records. If not passed, data will be assumed to be an array of records.
I set record_path
to "leavePeriod.leaves".split(".")
because that was the (nested) key(s) in the dictionary that held another layer of records, i.e. array/list of dictionaries. Leaving it blank leads pandas
to assume that data
is an array of records and has no further nesting.
meta
: list of paths (str or list of str), default None
Fields to use as metadata for each record in resulting table.
If you set record_path
without setting meta
pandas
will only normalize and extract the record_path
. To keep the upper-level (i.e. not-so-nested) keys, we need to specify them as meta
data.
Option 2
df = pd.json_normalize(data)
df
To "flatten" the leavePeriod.leaves
even further…
# explode the column and ignore index
df = df.explode(column="leavePeriod.leaves", ignore_index=True)
df.sample(5) # taking a sample to highlight what happened
# convert df["leavePeriod.leaves"] to a list and use `pd.json_normalize` again
leavePeriod = pd.json_normalize(df["leavePeriod.leaves"].to_list())
leavePeriod.sample(5)
# join `leavePeriod` with `df`
# and drop old "leavePeriod.leaves" columns
df = (
df.join(leavePeriod)
.drop(columns="leavePeriod.leaves")
)
df.sample(5)
Note: you can’t see all the columns in the image because it’s too wide so I have two images
I have the following nested dictionary with lists inside:
[{
"id": 467,
"status": 2,
"leavePeriod": {
"owner": {
"employeeNumber": "2620",
"firstName": "fn_467",
"lastName": "ln_467"
},
"ownerId": 46,
"leaves": [
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-06T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-06T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-07T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-07T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-11T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-11T00:00:00"
}
]
}
},
{
"id": 477,
"status": 2,
"leavePeriod": {
"owner": {
"employeeNumber": "2522",
"firstName": "fn_477",
"lastName": "lm_477"
},
"ownerId": 41,
"leaves": [
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-03-13T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-03-13T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-14T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-14T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-15T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-15T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-16T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-16T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-17T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-17T00:00:00"
}
]
}
}]
How can I flatten it and convert it to a dataframe using this library?
from pandas.io.json import json_normalize
I’d like this to be done in one step if possible, instead of flattening to a dataframe twice and then merging them
Thanks in advance.
Just use pd.json_normalize(<put_json_here>)
.
Using pandas.json_normalize
Note:
from pandas.io.json import json_normalize
results in the same output on my machine but raises aFutureWarning
.
Data
import pandas as pd # 1.5.1
data = [{
"id": 467,
"status": 2,
"leavePeriod": {
"owner": {
"employeeNumber": "2620",
"firstName": "fn_467",
"lastName": "ln_467"
},
"ownerId": 46,
"leaves": [
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-06T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-06T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-07T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-07T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-11T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-04-11T00:00:00"
}
]
}
},
{
"id": 477,
"status": 2,
"leavePeriod": {
"owner": {
"employeeNumber": "2522",
"firstName": "fn_477",
"lastName": "lm_477"
},
"ownerId": 41,
"leaves": [
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-03-13T00:00:00"
},
{
"leaveAccount": {
"id": 1121,
"name": "Vacation days 2021/2022",
"url": "https://some_link/1121"
},
"date": "2023-03-13T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-14T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-14T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-15T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-15T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-16T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-16T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-17T00:00:00"
},
{
"leaveAccount": {
"id": 1323,
"name": "RTT 2023",
"url": "https://some_link/1323"
},
"date": "2023-03-17T00:00:00"
}
]
}
}]
Option 1
# write out your meta data columns and record path
df = pd.json_normalize(
data=data,
meta=[
"id",
"status",
["leavePeriod", "owner", "employeeNumber"],
["leavePeriod", "owner", "firstName"],
["leavePeriod", "owner", "lastName"],
["leavePeriod", "ownerId"]
],
record_path=["leavePeriod", "leaves"]
)
Or if you’re lazy and don’t wanna split all the meta columns up by hand…
df = pd.json_normalize(
data=data,
meta=[
"id",
"status",
"leavePeriod.owner.employeeNumber".split("."),
"leavePeriod.owner.firstName".split("."),
"leavePeriod.owner.lastName".split("."),
"leavePeriod.ownerId".split("."),
],
record_path="leavePeriod.leaves".split(".")
)
df.sample(5)
Parameter Descriptions + Explanations
record_path
: str or list of str, default None
Path in each object to list of records. If not passed, data will be assumed to be an array of records.
I set
record_path
to"leavePeriod.leaves".split(".")
because that was the (nested) key(s) in the dictionary that held another layer of records, i.e. array/list of dictionaries. Leaving it blank leadspandas
to assume thatdata
is an array of records and has no further nesting.
meta
: list of paths (str or list of str), default None
Fields to use as metadata for each record in resulting table.
If you set
record_path
without settingmeta
pandas
will only normalize and extract therecord_path
. To keep the upper-level (i.e. not-so-nested) keys, we need to specify them asmeta
data.
Option 2
df = pd.json_normalize(data)
df
To "flatten" the leavePeriod.leaves
even further…
# explode the column and ignore index
df = df.explode(column="leavePeriod.leaves", ignore_index=True)
df.sample(5) # taking a sample to highlight what happened
# convert df["leavePeriod.leaves"] to a list and use `pd.json_normalize` again
leavePeriod = pd.json_normalize(df["leavePeriod.leaves"].to_list())
leavePeriod.sample(5)
# join `leavePeriod` with `df`
# and drop old "leavePeriod.leaves" columns
df = (
df.join(leavePeriod)
.drop(columns="leavePeriod.leaves")
)
df.sample(5)
Note: you can’t see all the columns in the image because it’s too wide so I have two images