mediawiki api pagination using python

Question:

I am trying out the MediaWiki api and I am trying to get 5000 articles but the limit(rclimit) is 500. I am new to pagination and I do not know how to go about it. I tried to pass in the continue parameter but I am getting an error(badcontinue)

the request:

while('continue' in response):
    params.update(response['continue'])
    
    session = mwapi.Session(
            host="https://en.wikipedia.org",
            user_agent="2022"
    )

    params = {
            "action": "query",
            "list": "recentchanges",
            "rcprop": "title|ids|user|tags|timestamp",  # information that we want for every change
            "rclimit": 500,  # no of changes we want
            "continue": idx
    }

    response = session.get(
        params
    )

    idx = response['continue']
Asked By: Brie Tasi

||

Answers:

You didn’t specify in your question, but it looks like you’re using the mwapi module. If that’s the case, then by looking at the documentation it seems that you should simply be setting continuation=True on your session.get request. Something like this:

import mwapi

session = mwapi.Session(
        host="https://en.wikipedia.org",
        user_agent="Outreachy round fall 2022"
)

params = {
        "action": "query",
        "list": "recentchanges",
        "rcprop": "title|ids|user|tags|timestamp",
        "rclimit": 100,
}

# When continuation=True, session.get returns a generator of
# responses
response = session.get(params, continuation=True)

for batch in response:
    for item in batch['query']['recentchanges']:
        print(item['type'], item.get('title', '<unknown>'))

Note that rclimit is not "the number of changes we want"; it is the number of changes included in each batch of responses (so in the above set of nested loops, with rclimit set to 100, for item in batch['query']['recentchanges'] will iterate 100 times for every iteration of the outside loop).

Answered By: larsks