Flattening nested JSON with pandas

Question

I’m trying to read a json file and put it into a dataframe:

#read JSON from Web Service...
raw_data = pd.read_json('https://ergast.com/api/f1/1996/1/laps/1.json')

Just after reading the complex file, we normalize it:

#normalize JSON
data = pd.json_normalize(raw_data['MRData']['RaceTable']['Races'])

This give us the following result:
(Not quite what we want, as we’d like to have each value of column Laps as a separated column and at the same level of the others, like Circuit.columns):

index	season	round	url	raceName	date	Laps	Circuit.circuitId	Circuit.url	Circuit.circuitName	Circuit.Location.lat	Circuit.Location.long	Circuit.Location.locality	Circuit.Location.country
0	1996	1	http://en.wikipedia.org/wiki/1996_Australian_Grand_Prix	Australian Grand Prix	1996-03-10	{‘number’: ‘1’, ‘Timings’: [{‘driverId’: ‘villeneuve’, ‘position’: ‘1’, ‘time’: ‘1:43.702’}, {‘driverId’: ‘damon_hill’, ‘position’: ‘2’, ‘time’: ‘1:44.243’}, {‘driverId’: ‘irvine’, ‘position’: ‘3’, ‘time’: ‘1:44.981’}, {‘driverId’: ‘michael_schumacher’, ‘position’: ‘4’, ‘time’: ‘1:45.188’}, {‘driverId’: ‘alesi’, ‘position’: ‘5’, ‘time’: ‘1:46.506’}, {‘driverId’: ‘hakkinen’, ‘position’: ‘6’, ‘time’: ‘1:48.430’}, {‘driverId’: ‘barrichello’, ‘position’: ‘7’, ‘time’: ‘1:49.058’}, {‘driverId’: ‘frentzen’, ‘position’: ‘8’, ‘time’: ‘1:49.468’}, {‘driverId’: ‘berger’, ‘position’: ‘9’, ‘time’: ‘1:49.726’}, {‘driverId’: ‘salo’, ‘position’: ’10’, ‘time’: ‘1:50.436’}, {‘driverId’: ‘verstappen’, ‘position’: ’11’, ‘time’: ‘1:50.930’}, {‘driverId’: ‘panis’, ‘position’: ’12’, ‘time’: ‘1:51.529’}, {‘driverId’: ‘fisichella’, ‘position’: ’13’, ‘time’: ‘1:52.690’}, {‘driverId’: ‘katayama’, ‘position’: ’14’, ‘time’: ‘1:53.435’}, {‘driverId’: ‘lamy’, ‘position’: ’15’, ‘time’: ‘1:54.047’}, {‘driverId’: ‘rosset’, ‘position’: ’16’, ‘time’: ‘1:54.831’}, {‘driverId’: ‘diniz’, ‘position’: ’17’, ‘time’: ‘1:56.143’}, {‘driverId’: ‘brundle’, ‘position’: ’18’, ‘time’: ‘1:56.481’}, {‘driverId’: ‘coulthard’, ‘position’: ’19’, ‘time’: ‘1:56.926’}]}	albert_park	http://en.wikipedia.org/wiki/Melbourne_Grand_Prix_Circuit	Albert Park Grand Prix Circuit	-37.8497	144.968	Melbourne	Australia

I was able to extract the values of the column Laps using:

data = pd.json_normalize(raw_data['MRData']['RaceTable']['Races'][0]['Laps'], record_path = 'Timings', meta = ['number'])

The result is:

index	driverId	position	time	number
0	villeneuve	1	1:43.702	1
1	damon_hill	2	1:44.243	1
2	irvine	3	1:44.981	1
3	michael_schumacher	4	1:45.188	1
4	alesi	5	1:46.506	1
5	hakkinen	6	1:48.430	1
6	barrichello	7	1:49.058	1
7	frentzen	8	1:49.468	1
8	berger	9	1:49.726	1
9	salo	10	1:50.436	1
10	verstappen	11	1:50.930	1
11	panis	12	1:51.529	1
12	fisichella	13	1:52.690	1
13	katayama	14	1:53.435	1
14	lamy	15	1:54.047	1
15	rosset	16	1:54.831	1
16	diniz	17	1:56.143	1
17	brundle	18	1:56.481	1
18	coulthard	19	1:56.926	1

Would it be possible to transform the raw_data in a away that in the end we have at least the following example:

season | round | number |   driverId | position |     time |  
  1996 |     1 |      1 | villeneuve |        1 | 1:43.702 |  
  1996 |     1 |      1 | damon_hill |        2 | 1:44.243 |  
  1996 |     1 |      1 |     irvine |        3 | 1:44.981 |

at the same level?

### EDIT ###

Testing and tweaking the code of the answer I discovered that if we put 0 (zero) on the raw data link we have all results for all laps of a single race. So, based on @Tranbi’s clever answer the answer would be:

raw_data = pd.read_json(f'https://ergast.com/api/f1/1996/1/laps/0.json?limit=1000')
df = pd.json_normalize(raw_data['MRData']['RaceTable']['Races'])

df_lap = pd.json_normalize(df['Laps'].values[0], record_path='Timings', meta='number')
df_lap[['season', 'round']] = df.loc[0, ['season', 'round']]
df_lap = df_lap[['season', 'round', 'number', 'driverId', 'position', 'time']]

df_lap

Asked By: Aloysia de Argenteuil

||

Source

Answer 1

From your URL, it seems you are retrieving laps one by one.
I would therefore create a lap dataframe from the ‘Laps’ column and add ‘season’ and ’round’ from the main df:

laps = []
# enclose the code below in loop(s) to retrieve the desired years/championship/laps
raw_data = pd.read_json('https://ergast.com/api/f1/1996/1/laps/1.json')
df = pd.json_normalize(raw_data['MRData']['RaceTable']['Races'])

df_lap = pd.json_normalize(df['Laps'].values[0], record_path='Timings', meta='number')
df_lap[['season', 'round']] = df.loc[0, ['season', 'round']]
df_lap = df_lap[['season', 'round', 'number', 'driverId', 'position', 'time']]
print(df_lap)
laps.append(df_lap)
# end of your loop

final_df = pd.concat(laps)

Answered By: Tranbi

Flattening nested JSON with pandas

Question:

Answers: