Python – How to get the page Wikipedia will redirect me to?
Question:
I want to store a few different Wikipedia links but I don’t want to store two different links to the same page twice. For example the following links are different but they point to the same Wikipedia page:
https://en.wikipedia.org/w/index.php?title=(1S)-1-Methyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole&redirect=no
https://en.wikipedia.org/w/index.php?title=(1S)-1-methyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole&redirect=no
__________________________________________________|___________________________________________________________
The only difference is that one uppercase character. Or the following links:
https://en.wikipedia.org/wiki/(0,1)-matrix
https://en.wikipedia.org/wiki/(0,1)_matrix
___________________________________|______
That are only different because one has ‘-‘ and the other has ‘_'(‘ ‘). So what I want is storing only one of them or the following links:
https://en.wikipedia.org/wiki/Tetrahydroharman
https://en.wikipedia.org/wiki/Logical_matrix
I have already tried the answer to this SO question. But it didn’t work for me. (The result is the initial URL for me, not the one wiki redirects me to in the browser) So how can I achieve what I’m looking for!?
Answers:
Since Wikipedia doesn’t have a proper 301/302 redirection what happens when you open the link is a proper 200 success response is returned and then url is changed using JS
I came up with a quick workable solution. First, remove &redirect=no
from the URL
In [42]: import requests
In [43]: r = requests.get('https://en.wikipedia.org/w/index.php?title=(1S)-1-Met
...: hyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole')
In [44]: tmp = r.content.replace('<link rel="canonical" href="', 'r@ndom}-=||').
...: split('r@ndom}-=||')[-1]
In [45]: idx = tmp.find('"/>')
In [46]: real_link = tmp[:idx]
In [47]: real_link
Out[47]: 'https://en.wikipedia.org/wiki/Tetrahydroharman'
The real URL value is stored in <link rel="canonical" href="
tag.
You can use above method which is good enough for your use case or you can use libraries like bs4 to parse the page and the get the link or use regex the extract the link.
The MediaWiki API provides various endpoints used in Wikipedia. You can use the MediaWiki Action API to get the target page of a redirect.
the result can be in JSON format (for example)
all you need is to parse it to get the value of the element to or the element title
This query will retrieve the target page for ‘Halab’:
https://en.wikipedia.org/w/api.php?action=query&titles=Halab&redirects&format=json
Result:
{
"batchcomplete":"",
"query":{
"redirects":[
{
"from":"Halab",
"to":"Aleppo"
}
],
"pages":{
"159244":{
"pageid":159244,
"ns":0,
"title":"Aleppo"
}
}
}
}
In Python:
import json
import requests
query = requests.get(r'https://en.wikipedia.org/w/api.php?action=query&titles={}&redirects&format=json'.format('Halab'))
data = json.loads(query.text)
The answer of Amit Tripathi throws an exception. this is my answer:
res = requests.get(url)
doc = lxml.html.fromstring(res.content)
for t in doc.xpath("//link[contains(@rel, 'canonical')]"):
new_url = str(t.attrib['href'])
from my experience, there might be a redirection to the same url. so better check (url != new_url) before using new_url.
I want to store a few different Wikipedia links but I don’t want to store two different links to the same page twice. For example the following links are different but they point to the same Wikipedia page:
https://en.wikipedia.org/w/index.php?title=(1S)-1-Methyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole&redirect=no
https://en.wikipedia.org/w/index.php?title=(1S)-1-methyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole&redirect=no
__________________________________________________|___________________________________________________________
The only difference is that one uppercase character. Or the following links:
https://en.wikipedia.org/wiki/(0,1)-matrix
https://en.wikipedia.org/wiki/(0,1)_matrix
___________________________________|______
That are only different because one has ‘-‘ and the other has ‘_'(‘ ‘). So what I want is storing only one of them or the following links:
https://en.wikipedia.org/wiki/Tetrahydroharman
https://en.wikipedia.org/wiki/Logical_matrix
I have already tried the answer to this SO question. But it didn’t work for me. (The result is the initial URL for me, not the one wiki redirects me to in the browser) So how can I achieve what I’m looking for!?
Since Wikipedia doesn’t have a proper 301/302 redirection what happens when you open the link is a proper 200 success response is returned and then url is changed using JS
I came up with a quick workable solution. First, remove &redirect=no
from the URL
In [42]: import requests
In [43]: r = requests.get('https://en.wikipedia.org/w/index.php?title=(1S)-1-Met
...: hyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole')
In [44]: tmp = r.content.replace('<link rel="canonical" href="', 'r@ndom}-=||').
...: split('r@ndom}-=||')[-1]
In [45]: idx = tmp.find('"/>')
In [46]: real_link = tmp[:idx]
In [47]: real_link
Out[47]: 'https://en.wikipedia.org/wiki/Tetrahydroharman'
The real URL value is stored in <link rel="canonical" href="
tag.
You can use above method which is good enough for your use case or you can use libraries like bs4 to parse the page and the get the link or use regex the extract the link.
The MediaWiki API provides various endpoints used in Wikipedia. You can use the MediaWiki Action API to get the target page of a redirect.
the result can be in JSON format (for example)
all you need is to parse it to get the value of the element to or the element title
This query will retrieve the target page for ‘Halab’:
https://en.wikipedia.org/w/api.php?action=query&titles=Halab&redirects&format=json
Result:
{
"batchcomplete":"",
"query":{
"redirects":[
{
"from":"Halab",
"to":"Aleppo"
}
],
"pages":{
"159244":{
"pageid":159244,
"ns":0,
"title":"Aleppo"
}
}
}
}
In Python:
import json
import requests
query = requests.get(r'https://en.wikipedia.org/w/api.php?action=query&titles={}&redirects&format=json'.format('Halab'))
data = json.loads(query.text)
The answer of Amit Tripathi throws an exception. this is my answer:
res = requests.get(url)
doc = lxml.html.fromstring(res.content)
for t in doc.xpath("//link[contains(@rel, 'canonical')]"):
new_url = str(t.attrib['href'])
from my experience, there might be a redirection to the same url. so better check (url != new_url) before using new_url.