Python requests arguments/dealing with api pagination

Question:

I’m playing around with the Angel List (AL) API and want to pull all jobs in San San Francisco.
Since I couldn’t find an active Python wrapper for the api (if I make any headway, I think I’d like to make my own), I’m using the requests library.

The AL API’s results are paginated, and I can’t figure out how to move beyond the first page of the results.

Here is my code:

import requests
r_sanfran = requests.get("https://api.angel.co/1/tags/1664/jobs").json()
r_sanfran.keys()
# returns [u'per_page', u'last_page', u'total', u'jobs', u'page']
r_sanfran['last_page']
#returns 16
r_sanfran['page']
# returns 1

I tried adding arguments to requests.get, but that didn’t work. I also tried something really dumb – changing the value of the ‘page’ key like that was magically going to paginate for me.

eg. r_sanfran['page'] = 2

I’m guessing it’s something relatively simple, but I can’t seem to figure it out so any help would be awesome.

Thanks as always.

Angel List API documentation if it’s helpful.

Asked By: crock1255

||

Answers:

Read last_page and make a get request for each page in the range:

import requests

r_sanfran = requests.get("https://api.angel.co/1/tags/1664/jobs").json()
num_pages = r_sanfran['last_page']

for page in range(2, num_pages + 1):
    r_sanfran = requests.get("https://api.angel.co/1/tags/1664/jobs", params={'page': page}).json()
    print r_sanfran['page']
    # TODO: extract the data
Answered By: alecxe

Improving on @alecxe’s answer: if you use a Python Generator and a requests HTTP session you can improve the performance and resource usage if you are querying lots of pages or very large pages.

import requests

session = requests.Session()

def get_jobs():
    url = "https://api.angel.co/1/tags/1664/jobs" 
    first_page = session.get(url).json()
    yield first_page
    num_pages = first_page['last_page']

    for page in range(2, num_pages + 1):
        next_page = session.get(url, params={'page': page}).json()
        yield next_page

for page in get_jobs():
    # TODO: process the page
Answered By: dh762

I came across a scenario where the API didn’t return pages but rather a min/max value. I created this, and I think it will work for both situations. This will automatically increase the increment until it reaches the end, and then it will stop the while loop.

max_version = [1]
while len(max_version) > 0:
    r = requests.get(url, headers=headers, params={"page": max_version[0]}).json()
    next_page = r['page']
    if next_page is not None:
        max_version[0] = next_page
        Process data...
    else:
        max_version.clear() # Stop the while loop
Answered By: joshlsullivan

Further improving on @dh762 ‘s answer, you can use while and have all the requests done in it without having 2 yield statements.

Eg:

import requests

session = requests.Session()

def get_jobs():
    url = "https://api.angel.co/1/tags/1664/jobs"
    currP = 1
    totalP = 2 #assuming there's gonna be 2nd page, it'll get overwritten if not.
    while (currP <= totalP):
        page = session.get(url, params={'page': currP}).json()
        totalP = page['last_page']
        currP += 1
        yield page

for page in get_jobs():
    # TODO: process the page
Answered By: Dragunov

I got the pages working in Python although I’m not sure if it could be that similar of a situation since I was working with a crypto API:

pages=3
fl=client.get_fills(ord['product_id'])#fl equals paginated message requested
fil=list(fl)
#you can skip that last 2 lines with: fil=list(client.get_fills(ord['product_id']))
#they're just for clarification
print(json.dumps(fil[0:pages], indent=2, sort_keys=True))
Answered By: Sam_Carmichael

Here is what worked for me, using **extraArgs

# our initial url
url = f'{base_url}/{api_endpoint}'

# we set a next token, to start our while loop
NextToken = True

# we specify our extra args object
extraArgs = {
    "url": url,
    "headers": headers
}

while NextToken is not None:
    # call api
    r = requests.get(**extraArgs)
    result = r.json()

    # if next url exists, add to method arguments, and do next call with it
    if 'next' in result['_links']:
        next_link = result['_links']['next']['href']
        print(f'found next link: {next_link}')
        extraArgs['url'] = next_link
    else:
        break

Answered By: user2965205

If you have to pull data from HTTP API that has an endpoint accepting parameters:

pageNumber=1,2,...

And returning JSON:

{
    "pageCount": 5,
    "entities": [
      {"key":  "val1", ...},
      {"key":  "val2", ...},
      ...
    ]
}

Then you can iterate over all pages with following code (after running pip3 install bezalel):

import requests
from bezalel import PaginatedApiIterator

for page in PaginatedApiIterator(requests.Session(), url=f"https://your/api",
                                                   request_page_number_param_name="pageNumber",
                                                   response_page_count_field_name="pageCount",
                                                   response_records_field_name="entities"):
    print(f"Page: {page}")

It will print:

Page: [{"key":  "val1", ...}, {"key":  "val2", ...}, ...]
Page: [{"key":  "val100", ...}, {"key":  "val101", ...}, ...]
Page: [{"key":  "val200", ...}, {"key":  "val201", ...}, ...]
...

More docs: https://pypi.org/project/bezalel/

Answered By: mateo7