Python requests arguments/dealing with api pagination
Question:
I’m playing around with the Angel List (AL) API and want to pull all jobs in San San Francisco.
Since I couldn’t find an active Python wrapper for the api (if I make any headway, I think I’d like to make my own), I’m using the requests library.
The AL API’s results are paginated, and I can’t figure out how to move beyond the first page of the results.
Here is my code:
import requests
r_sanfran = requests.get("https://api.angel.co/1/tags/1664/jobs").json()
r_sanfran.keys()
# returns [u'per_page', u'last_page', u'total', u'jobs', u'page']
r_sanfran['last_page']
#returns 16
r_sanfran['page']
# returns 1
I tried adding arguments to requests.get
, but that didn’t work. I also tried something really dumb – changing the value of the ‘page’ key like that was magically going to paginate for me.
eg. r_sanfran['page'] = 2
I’m guessing it’s something relatively simple, but I can’t seem to figure it out so any help would be awesome.
Thanks as always.
Angel List API documentation if it’s helpful.
Answers:
Read last_page
and make a get request for each page in the range:
import requests
r_sanfran = requests.get("https://api.angel.co/1/tags/1664/jobs").json()
num_pages = r_sanfran['last_page']
for page in range(2, num_pages + 1):
r_sanfran = requests.get("https://api.angel.co/1/tags/1664/jobs", params={'page': page}).json()
print r_sanfran['page']
# TODO: extract the data
Improving on @alecxe’s answer: if you use a Python Generator and a requests HTTP session you can improve the performance and resource usage if you are querying lots of pages or very large pages.
import requests
session = requests.Session()
def get_jobs():
url = "https://api.angel.co/1/tags/1664/jobs"
first_page = session.get(url).json()
yield first_page
num_pages = first_page['last_page']
for page in range(2, num_pages + 1):
next_page = session.get(url, params={'page': page}).json()
yield next_page
for page in get_jobs():
# TODO: process the page
I came across a scenario where the API didn’t return pages but rather a min/max value. I created this, and I think it will work for both situations. This will automatically increase the increment until it reaches the end, and then it will stop the while loop.
max_version = [1]
while len(max_version) > 0:
r = requests.get(url, headers=headers, params={"page": max_version[0]}).json()
next_page = r['page']
if next_page is not None:
max_version[0] = next_page
Process data...
else:
max_version.clear() # Stop the while loop
Further improving on @dh762 ‘s answer, you can use while and have all the requests done in it without having 2 yield statements.
Eg:
import requests
session = requests.Session()
def get_jobs():
url = "https://api.angel.co/1/tags/1664/jobs"
currP = 1
totalP = 2 #assuming there's gonna be 2nd page, it'll get overwritten if not.
while (currP <= totalP):
page = session.get(url, params={'page': currP}).json()
totalP = page['last_page']
currP += 1
yield page
for page in get_jobs():
# TODO: process the page
I got the pages working in Python although I’m not sure if it could be that similar of a situation since I was working with a crypto API:
pages=3
fl=client.get_fills(ord['product_id'])#fl equals paginated message requested
fil=list(fl)
#you can skip that last 2 lines with: fil=list(client.get_fills(ord['product_id']))
#they're just for clarification
print(json.dumps(fil[0:pages], indent=2, sort_keys=True))
Here is what worked for me, using **extraArgs
# our initial url
url = f'{base_url}/{api_endpoint}'
# we set a next token, to start our while loop
NextToken = True
# we specify our extra args object
extraArgs = {
"url": url,
"headers": headers
}
while NextToken is not None:
# call api
r = requests.get(**extraArgs)
result = r.json()
# if next url exists, add to method arguments, and do next call with it
if 'next' in result['_links']:
next_link = result['_links']['next']['href']
print(f'found next link: {next_link}')
extraArgs['url'] = next_link
else:
break
If you have to pull data from HTTP API that has an endpoint accepting parameters:
pageNumber=1,2,...
And returning JSON:
{
"pageCount": 5,
"entities": [
{"key": "val1", ...},
{"key": "val2", ...},
...
]
}
Then you can iterate over all pages with following code (after running pip3 install bezalel
):
import requests
from bezalel import PaginatedApiIterator
for page in PaginatedApiIterator(requests.Session(), url=f"https://your/api",
request_page_number_param_name="pageNumber",
response_page_count_field_name="pageCount",
response_records_field_name="entities"):
print(f"Page: {page}")
It will print:
Page: [{"key": "val1", ...}, {"key": "val2", ...}, ...]
Page: [{"key": "val100", ...}, {"key": "val101", ...}, ...]
Page: [{"key": "val200", ...}, {"key": "val201", ...}, ...]
...
More docs: https://pypi.org/project/bezalel/
I’m playing around with the Angel List (AL) API and want to pull all jobs in San San Francisco.
Since I couldn’t find an active Python wrapper for the api (if I make any headway, I think I’d like to make my own), I’m using the requests library.
The AL API’s results are paginated, and I can’t figure out how to move beyond the first page of the results.
Here is my code:
import requests
r_sanfran = requests.get("https://api.angel.co/1/tags/1664/jobs").json()
r_sanfran.keys()
# returns [u'per_page', u'last_page', u'total', u'jobs', u'page']
r_sanfran['last_page']
#returns 16
r_sanfran['page']
# returns 1
I tried adding arguments to requests.get
, but that didn’t work. I also tried something really dumb – changing the value of the ‘page’ key like that was magically going to paginate for me.
eg. r_sanfran['page'] = 2
I’m guessing it’s something relatively simple, but I can’t seem to figure it out so any help would be awesome.
Thanks as always.
Angel List API documentation if it’s helpful.
Read last_page
and make a get request for each page in the range:
import requests
r_sanfran = requests.get("https://api.angel.co/1/tags/1664/jobs").json()
num_pages = r_sanfran['last_page']
for page in range(2, num_pages + 1):
r_sanfran = requests.get("https://api.angel.co/1/tags/1664/jobs", params={'page': page}).json()
print r_sanfran['page']
# TODO: extract the data
Improving on @alecxe’s answer: if you use a Python Generator and a requests HTTP session you can improve the performance and resource usage if you are querying lots of pages or very large pages.
import requests
session = requests.Session()
def get_jobs():
url = "https://api.angel.co/1/tags/1664/jobs"
first_page = session.get(url).json()
yield first_page
num_pages = first_page['last_page']
for page in range(2, num_pages + 1):
next_page = session.get(url, params={'page': page}).json()
yield next_page
for page in get_jobs():
# TODO: process the page
I came across a scenario where the API didn’t return pages but rather a min/max value. I created this, and I think it will work for both situations. This will automatically increase the increment until it reaches the end, and then it will stop the while loop.
max_version = [1]
while len(max_version) > 0:
r = requests.get(url, headers=headers, params={"page": max_version[0]}).json()
next_page = r['page']
if next_page is not None:
max_version[0] = next_page
Process data...
else:
max_version.clear() # Stop the while loop
Further improving on @dh762 ‘s answer, you can use while and have all the requests done in it without having 2 yield statements.
Eg:
import requests
session = requests.Session()
def get_jobs():
url = "https://api.angel.co/1/tags/1664/jobs"
currP = 1
totalP = 2 #assuming there's gonna be 2nd page, it'll get overwritten if not.
while (currP <= totalP):
page = session.get(url, params={'page': currP}).json()
totalP = page['last_page']
currP += 1
yield page
for page in get_jobs():
# TODO: process the page
I got the pages working in Python although I’m not sure if it could be that similar of a situation since I was working with a crypto API:
pages=3
fl=client.get_fills(ord['product_id'])#fl equals paginated message requested
fil=list(fl)
#you can skip that last 2 lines with: fil=list(client.get_fills(ord['product_id']))
#they're just for clarification
print(json.dumps(fil[0:pages], indent=2, sort_keys=True))
Here is what worked for me, using **extraArgs
# our initial url
url = f'{base_url}/{api_endpoint}'
# we set a next token, to start our while loop
NextToken = True
# we specify our extra args object
extraArgs = {
"url": url,
"headers": headers
}
while NextToken is not None:
# call api
r = requests.get(**extraArgs)
result = r.json()
# if next url exists, add to method arguments, and do next call with it
if 'next' in result['_links']:
next_link = result['_links']['next']['href']
print(f'found next link: {next_link}')
extraArgs['url'] = next_link
else:
break
If you have to pull data from HTTP API that has an endpoint accepting parameters:
pageNumber=1,2,...
And returning JSON:
{
"pageCount": 5,
"entities": [
{"key": "val1", ...},
{"key": "val2", ...},
...
]
}
Then you can iterate over all pages with following code (after running pip3 install bezalel
):
import requests
from bezalel import PaginatedApiIterator
for page in PaginatedApiIterator(requests.Session(), url=f"https://your/api",
request_page_number_param_name="pageNumber",
response_page_count_field_name="pageCount",
response_records_field_name="entities"):
print(f"Page: {page}")
It will print:
Page: [{"key": "val1", ...}, {"key": "val2", ...}, ...]
Page: [{"key": "val100", ...}, {"key": "val101", ...}, ...]
Page: [{"key": "val200", ...}, {"key": "val201", ...}, ...]
...
More docs: https://pypi.org/project/bezalel/