mediawiki api pagination using python
Question:
I am trying out the MediaWiki api and I am trying to get 5000 articles but the limit(rclimit) is 500. I am new to pagination and I do not know how to go about it. I tried to pass in the continue parameter but I am getting an error(badcontinue)
the request:
while('continue' in response):
params.update(response['continue'])
session = mwapi.Session(
host="https://en.wikipedia.org",
user_agent="2022"
)
params = {
"action": "query",
"list": "recentchanges",
"rcprop": "title|ids|user|tags|timestamp", # information that we want for every change
"rclimit": 500, # no of changes we want
"continue": idx
}
response = session.get(
params
)
idx = response['continue']
Answers:
You didn’t specify in your question, but it looks like you’re using the mwapi
module. If that’s the case, then by looking at the documentation it seems that you should simply be setting continuation=True
on your session.get
request. Something like this:
import mwapi
session = mwapi.Session(
host="https://en.wikipedia.org",
user_agent="Outreachy round fall 2022"
)
params = {
"action": "query",
"list": "recentchanges",
"rcprop": "title|ids|user|tags|timestamp",
"rclimit": 100,
}
# When continuation=True, session.get returns a generator of
# responses
response = session.get(params, continuation=True)
for batch in response:
for item in batch['query']['recentchanges']:
print(item['type'], item.get('title', '<unknown>'))
Note that rclimit
is not "the number of changes we want"; it is the number of changes included in each batch of responses (so in the above set of nested loops, with rclimit
set to 100
, for item in batch['query']['recentchanges']
will iterate 100 times for every iteration of the outside loop).
I am trying out the MediaWiki api and I am trying to get 5000 articles but the limit(rclimit) is 500. I am new to pagination and I do not know how to go about it. I tried to pass in the continue parameter but I am getting an error(badcontinue)
the request:
while('continue' in response):
params.update(response['continue'])
session = mwapi.Session(
host="https://en.wikipedia.org",
user_agent="2022"
)
params = {
"action": "query",
"list": "recentchanges",
"rcprop": "title|ids|user|tags|timestamp", # information that we want for every change
"rclimit": 500, # no of changes we want
"continue": idx
}
response = session.get(
params
)
idx = response['continue']
You didn’t specify in your question, but it looks like you’re using the mwapi
module. If that’s the case, then by looking at the documentation it seems that you should simply be setting continuation=True
on your session.get
request. Something like this:
import mwapi
session = mwapi.Session(
host="https://en.wikipedia.org",
user_agent="Outreachy round fall 2022"
)
params = {
"action": "query",
"list": "recentchanges",
"rcprop": "title|ids|user|tags|timestamp",
"rclimit": 100,
}
# When continuation=True, session.get returns a generator of
# responses
response = session.get(params, continuation=True)
for batch in response:
for item in batch['query']['recentchanges']:
print(item['type'], item.get('title', '<unknown>'))
Note that rclimit
is not "the number of changes we want"; it is the number of changes included in each batch of responses (so in the above set of nested loops, with rclimit
set to 100
, for item in batch['query']['recentchanges']
will iterate 100 times for every iteration of the outside loop).