json_normalize produces a KeyError when trying to extract certain attributes
Question:
Here is a subset of my json file:
d = {'data': {'questions': [{'id': 6574,
'text': 'Question #1',
'instructionalText': '',
'minimumResponses': 0,
'maximumResponses': None,
'sortOrder': 1,
'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
{'id': 362950, 'text': 'Answer #2', 'parentId': None},
{'id': 362951, 'text': 'Answer #3', 'parentId': None},
{'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}
I want to place this into a dataframe with each question and a row for each answer.
Python code:
from pandas import json_normalize
import json
fields = ['text','answers.text']
with open(R'response.json') as f:
d = json.load(f)
data = json_normalize(d['data'],['questions'],errors='ignore')
data = data[fields]
print(data)
Which produces a KeyError:
KeyError: "['answers.text'] not in index"
Been at this for a few hours and absolutely cannot figure this out. I feel like it should be quite simple, but it never is.
Answers:
This is the technique I generally use
json_normalize()
top level list
explode()
child list
, reset_index()
for step #3
- expand
dict
within child list
with apply(pd.Series)
d = {'data': {'questions': [{'id': 6574,
'text': 'Question #1',
'instructionalText': '',
'minimumResponses': 0,
'maximumResponses': None,
'sortOrder': 1,
'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
{'id': 362950, 'text': 'Answer #2', 'parentId': None},
{'id': 362951, 'text': 'Answer #3', 'parentId': None},
{'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}
df = pd.json_normalize(d["data"]["questions"]).explode("answers").reset_index(drop=True)
df = df.join(df["answers"].apply(pd.Series), rsuffix="_ans").drop(columns="answers")
id
text
instructionalText
minimumResponses
maximumResponses
sortOrder
id_ans
text_ans
parentId
0
6574
Question #1
0
1
362949
Answer #1
1
6574
Question #1
0
1
362950
Answer #2
2
6574
Question #1
0
1
362951
Answer #3
3
6574
Question #1
0
1
362952
Answer #4
- Use
record_prefix
, with record_path
and meta
, so d
can be normalized all at once
pd.json_normalize
will result in a ValueError
when there are overlapping key
names between the record_path
and meta
, and 'id'
and 'text'
are in both.
ValueError: Conflicting metadata name id, need distinguishing prefix
occurs without using record_path
.
- The
KeyError
occurs because 'answers.text'
is not in d
, it’s created by .json_normalize()
- If there are any top level
keys
that aren’t required in df
, remove them from meta
.
import pandas as pd
# normalize d
df = pd.json_normalize(data=d['data']['questions'],
record_path= ['answers'],
meta=['id', 'text', 'instructionalText', 'minimumResponses', 'maximumResponses', 'sortOrder'],
record_prefix='answers_')
# display(df)
answers_id answers_text answers_parentId id text instructionalText minimumResponses maximumResponses sortOrder
0 362949 Answer #1 None 6574 Question #1 0 None 1
1 362950 Answer #2 None 6574 Question #1 0 None 1
2 362951 Answer #3 None 6574 Question #1 0 None 1
3 362952 Answer #4 None 6574 Question #1 0 None 1
4 262949 Answer #1 None 4756 Question #2 No cheating, cheater 0 None 1
5 262950 Answer #2 None 4756 Question #2 No cheating, cheater 0 None 1
6 262951 Answer #3 None 4756 Question #2 No cheating, cheater 0 None 1
7 262952 Answer #4 None 4756 Question #2 No cheating, cheater 0 None 1
Expanded Test Data
d = {'data': {'questions': [{'id': 6574,
'text': 'Question #1',
'instructionalText': '',
'minimumResponses': 0,
'maximumResponses': None,
'sortOrder': 1,
'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
{'id': 362950, 'text': 'Answer #2', 'parentId': None},
{'id': 362951, 'text': 'Answer #3', 'parentId': None},
{'id': 362952, 'text': 'Answer #4', 'parentId': None}]},
{'id': 4756,
'text': 'Question #2',
'instructionalText': 'No cheating, cheater',
'minimumResponses': 0,
'maximumResponses': None,
'sortOrder': 1,
'answers': [{'id': 262949, 'text': 'Answer #1', 'parentId': None},
{'id': 262950, 'text': 'Answer #2', 'parentId': None},
{'id': 262951, 'text': 'Answer #3', 'parentId': None},
{'id': 262952, 'text': 'Answer #4', 'parentId': None}]}]}}
- In regards to the other answer, using
.apply(pd.Series)
is not recommended, because it is incredibly slow.
- See the timing analysis in SO: Splitting dictionary/list inside a Pandas Column into Separate Columns
- 53 minutes for 10M rows
Here is a subset of my json file:
d = {'data': {'questions': [{'id': 6574,
'text': 'Question #1',
'instructionalText': '',
'minimumResponses': 0,
'maximumResponses': None,
'sortOrder': 1,
'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
{'id': 362950, 'text': 'Answer #2', 'parentId': None},
{'id': 362951, 'text': 'Answer #3', 'parentId': None},
{'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}
I want to place this into a dataframe with each question and a row for each answer.
Python code:
from pandas import json_normalize
import json
fields = ['text','answers.text']
with open(R'response.json') as f:
d = json.load(f)
data = json_normalize(d['data'],['questions'],errors='ignore')
data = data[fields]
print(data)
Which produces a KeyError:
KeyError: "['answers.text'] not in index"
Been at this for a few hours and absolutely cannot figure this out. I feel like it should be quite simple, but it never is.
This is the technique I generally use
json_normalize()
top level listexplode()
childlist
,reset_index()
for step #3- expand
dict
within childlist
withapply(pd.Series)
d = {'data': {'questions': [{'id': 6574,
'text': 'Question #1',
'instructionalText': '',
'minimumResponses': 0,
'maximumResponses': None,
'sortOrder': 1,
'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
{'id': 362950, 'text': 'Answer #2', 'parentId': None},
{'id': 362951, 'text': 'Answer #3', 'parentId': None},
{'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}
df = pd.json_normalize(d["data"]["questions"]).explode("answers").reset_index(drop=True)
df = df.join(df["answers"].apply(pd.Series), rsuffix="_ans").drop(columns="answers")
id | text | instructionalText | minimumResponses | maximumResponses | sortOrder | id_ans | text_ans | parentId | |
---|---|---|---|---|---|---|---|---|---|
0 | 6574 | Question #1 | 0 | 1 | 362949 | Answer #1 | |||
1 | 6574 | Question #1 | 0 | 1 | 362950 | Answer #2 | |||
2 | 6574 | Question #1 | 0 | 1 | 362951 | Answer #3 | |||
3 | 6574 | Question #1 | 0 | 1 | 362952 | Answer #4 |
- Use
record_prefix
, withrecord_path
andmeta
, sod
can be normalized all at oncepd.json_normalize
will result in aValueError
when there are overlappingkey
names between therecord_path
andmeta
, and'id'
and'text'
are in both.ValueError: Conflicting metadata name id, need distinguishing prefix
occurs without usingrecord_path
.
- The
KeyError
occurs because'answers.text'
is not ind
, it’s created by.json_normalize()
- If there are any top level
keys
that aren’t required indf
, remove them frommeta
.
import pandas as pd
# normalize d
df = pd.json_normalize(data=d['data']['questions'],
record_path= ['answers'],
meta=['id', 'text', 'instructionalText', 'minimumResponses', 'maximumResponses', 'sortOrder'],
record_prefix='answers_')
# display(df)
answers_id answers_text answers_parentId id text instructionalText minimumResponses maximumResponses sortOrder
0 362949 Answer #1 None 6574 Question #1 0 None 1
1 362950 Answer #2 None 6574 Question #1 0 None 1
2 362951 Answer #3 None 6574 Question #1 0 None 1
3 362952 Answer #4 None 6574 Question #1 0 None 1
4 262949 Answer #1 None 4756 Question #2 No cheating, cheater 0 None 1
5 262950 Answer #2 None 4756 Question #2 No cheating, cheater 0 None 1
6 262951 Answer #3 None 4756 Question #2 No cheating, cheater 0 None 1
7 262952 Answer #4 None 4756 Question #2 No cheating, cheater 0 None 1
Expanded Test Data
d = {'data': {'questions': [{'id': 6574,
'text': 'Question #1',
'instructionalText': '',
'minimumResponses': 0,
'maximumResponses': None,
'sortOrder': 1,
'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
{'id': 362950, 'text': 'Answer #2', 'parentId': None},
{'id': 362951, 'text': 'Answer #3', 'parentId': None},
{'id': 362952, 'text': 'Answer #4', 'parentId': None}]},
{'id': 4756,
'text': 'Question #2',
'instructionalText': 'No cheating, cheater',
'minimumResponses': 0,
'maximumResponses': None,
'sortOrder': 1,
'answers': [{'id': 262949, 'text': 'Answer #1', 'parentId': None},
{'id': 262950, 'text': 'Answer #2', 'parentId': None},
{'id': 262951, 'text': 'Answer #3', 'parentId': None},
{'id': 262952, 'text': 'Answer #4', 'parentId': None}]}]}}
- In regards to the other answer, using
.apply(pd.Series)
is not recommended, because it is incredibly slow.- See the timing analysis in SO: Splitting dictionary/list inside a Pandas Column into Separate Columns
- 53 minutes for 10M rows