Collect multiple values out of JSON file via API in python, where some values can be none / []
Question:
I want to extract the values of scientific publications from the openalex API. However, since this API does not have complete values for all publications, the resulting JSON file is not always complete. If the file is complete, my code will run without issues. If the API does not have all information available, it can happen that the following result is found but cannot get interpreted: "institutions":[] instead of "institutions":[{"id":"https://openalex.org/I2057…}{…}]. As a result, I always get an "IndexError: list index out of range".
After an extensive search, I have already tried to solve the problem with the help of try / except or if-queries (if required, I can also provide them). Unfortunately, I did not succeed.
My goal is that in the charlist, in places where no information is available ([]), None or Null is entered. The goal is to program the code as performant as possible since I will have a high six-digit number of requests. This is, of course, already cleared with the API operator.
My code listed below already works for complete JSON files (upper magid_list) but not for incomplete entries (2301544176) as in the lower, not commented-out magid_list.
import requests
import json
baseurl = 'https://api.openalex.org/works?filter=ids.mag:'
#**upper magid_listworks without problems**
#magid_list = [2301543590, 2301543835]
#**error occur**
#**see page "https://api.openalex.org/works?filter=ids.mag:2301544176" no information for institution given**
magid_list = [2301543590, 2301543835, 2301544176]
def main_request(baseurl, endpoint):
r = requests.get(baseurl + endpoint)
return r.json()
def parse_json(response):
charlist = []
pupdate = data['results'][0]['publication_date']
display_name = data['results'][0]['display_name']
for item in response['results'][0]['authorships']:
char = {
'magid': str(x),
'display_name': display_name,
'pupdate': pupdate,
'author': item['author']['display_name'],
'institution_id': item['institutions'][0]['id']
}
charlist.append(char)
return charlist
finallist = []
for x in magid_list:
print(x)
data = main_request(baseurl, str(x))
finallist.extend(parse_json(main_request(baseurl, str(x))))
df = pd.DataFrame(finallist)
print(df.head(), df.tail())
If I can provide further information or clarification, let me know.
Attached you can find the full IndexError Traceback:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
f:AlexPE__programmingMasterarbeit.ipynb Cell 153 in <cell line: 37>()
37 for x in list:
38 print(x)
---> 39 finallist.extend(parse_json(main_request(baseurl, str(x))))
41 df = pd.DataFrame(finallist)
43 #data = main_request(baseurl, endpoint)
44 #print(get_pages(data))
45 #print(parse_json(data))
f:AlexPE__programmingMasterarbeit.ipynb Cell 153 in parse_json(response)
20 display_name = data['results'][0]['display_name']
23 for item in response['results'][0]['authorships']:
24 char = {
25 'magid': str(x),
26 'display_name': display_name,
27 'pupdate': pupdate,
28 'author': item['author']['display_name'],
---> 29 'institution_id': item['institutions'][0]['id']
30 }
32 charlist.append(char)
33 return charlist
IndexError: list index out of range
Answers:
Check for the existence of values before attempting to access them:
def parse_json(response):
charlist = []
pupdate = display_name = None
if data['results']:
pupdate = data['results'][0].get('publication_date')
display_name = data['results'][0].get('display_name')
for item in response['results'][0]['authorships']:
institution_id = None
if item['institutions']:
institution_id = item['institutions'][0].get('id')
char = {
'magid': str(x),
'display_name': display_name,
'pupdate': pupdate,
'author': item['author']['display_name'],
'institution_id': institution_id
}
charlist.append(char)
return charlist
I want to extract the values of scientific publications from the openalex API. However, since this API does not have complete values for all publications, the resulting JSON file is not always complete. If the file is complete, my code will run without issues. If the API does not have all information available, it can happen that the following result is found but cannot get interpreted: "institutions":[] instead of "institutions":[{"id":"https://openalex.org/I2057…}{…}]. As a result, I always get an "IndexError: list index out of range".
After an extensive search, I have already tried to solve the problem with the help of try / except or if-queries (if required, I can also provide them). Unfortunately, I did not succeed.
My goal is that in the charlist, in places where no information is available ([]), None or Null is entered. The goal is to program the code as performant as possible since I will have a high six-digit number of requests. This is, of course, already cleared with the API operator.
My code listed below already works for complete JSON files (upper magid_list) but not for incomplete entries (2301544176) as in the lower, not commented-out magid_list.
import requests
import json
baseurl = 'https://api.openalex.org/works?filter=ids.mag:'
#**upper magid_listworks without problems**
#magid_list = [2301543590, 2301543835]
#**error occur**
#**see page "https://api.openalex.org/works?filter=ids.mag:2301544176" no information for institution given**
magid_list = [2301543590, 2301543835, 2301544176]
def main_request(baseurl, endpoint):
r = requests.get(baseurl + endpoint)
return r.json()
def parse_json(response):
charlist = []
pupdate = data['results'][0]['publication_date']
display_name = data['results'][0]['display_name']
for item in response['results'][0]['authorships']:
char = {
'magid': str(x),
'display_name': display_name,
'pupdate': pupdate,
'author': item['author']['display_name'],
'institution_id': item['institutions'][0]['id']
}
charlist.append(char)
return charlist
finallist = []
for x in magid_list:
print(x)
data = main_request(baseurl, str(x))
finallist.extend(parse_json(main_request(baseurl, str(x))))
df = pd.DataFrame(finallist)
print(df.head(), df.tail())
If I can provide further information or clarification, let me know.
Attached you can find the full IndexError Traceback:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
f:AlexPE__programmingMasterarbeit.ipynb Cell 153 in <cell line: 37>()
37 for x in list:
38 print(x)
---> 39 finallist.extend(parse_json(main_request(baseurl, str(x))))
41 df = pd.DataFrame(finallist)
43 #data = main_request(baseurl, endpoint)
44 #print(get_pages(data))
45 #print(parse_json(data))
f:AlexPE__programmingMasterarbeit.ipynb Cell 153 in parse_json(response)
20 display_name = data['results'][0]['display_name']
23 for item in response['results'][0]['authorships']:
24 char = {
25 'magid': str(x),
26 'display_name': display_name,
27 'pupdate': pupdate,
28 'author': item['author']['display_name'],
---> 29 'institution_id': item['institutions'][0]['id']
30 }
32 charlist.append(char)
33 return charlist
IndexError: list index out of range
Check for the existence of values before attempting to access them:
def parse_json(response):
charlist = []
pupdate = display_name = None
if data['results']:
pupdate = data['results'][0].get('publication_date')
display_name = data['results'][0].get('display_name')
for item in response['results'][0]['authorships']:
institution_id = None
if item['institutions']:
institution_id = item['institutions'][0].get('id')
char = {
'magid': str(x),
'display_name': display_name,
'pupdate': pupdate,
'author': item['author']['display_name'],
'institution_id': institution_id
}
charlist.append(char)
return charlist