Python – How to get the page Wikipedia will redirect me to?

Question:

I want to store a few different Wikipedia links but I don’t want to store two different links to the same page twice. For example the following links are different but they point to the same Wikipedia page:

https://en.wikipedia.org/w/index.php?title=(1S)-1-Methyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole&redirect=no 
https://en.wikipedia.org/w/index.php?title=(1S)-1-methyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole&redirect=no
__________________________________________________|___________________________________________________________

The only difference is that one uppercase character. Or the following links:

https://en.wikipedia.org/wiki/(0,1)-matrix 
https://en.wikipedia.org/wiki/(0,1)_matrix 
___________________________________|______ 

That are only different because one has ‘-‘ and the other has ‘_'(‘ ‘). So what I want is storing only one of them or the following links:

https://en.wikipedia.org/wiki/Tetrahydroharman 
https://en.wikipedia.org/wiki/Logical_matrix 

I have already tried the answer to this SO question. But it didn’t work for me. (The result is the initial URL for me, not the one wiki redirects me to in the browser) So how can I achieve what I’m looking for!?

Asked By: tgwtdt

||

Answers:

Since Wikipedia doesn’t have a proper 301/302 redirection what happens when you open the link is a proper 200 success response is returned and then url is changed using JS

I came up with a quick workable solution. First, remove &redirect=no from the URL

In [42]: import requests

In [43]: r = requests.get('https://en.wikipedia.org/w/index.php?title=(1S)-1-Met
    ...: hyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole')

In [44]: tmp = r.content.replace('<link rel="canonical" href="', 'r@ndom}-=||').
    ...: split('r@ndom}-=||')[-1]

In [45]: idx = tmp.find('"/>')

In [46]: real_link = tmp[:idx]

In [47]: real_link
Out[47]: 'https://en.wikipedia.org/wiki/Tetrahydroharman'

The real URL value is stored in <link rel="canonical" href=" tag.

You can use above method which is good enough for your use case or you can use libraries like bs4 to parse the page and the get the link or use regex the extract the link.

Answered By: Amit Tripathi

The MediaWiki API provides various endpoints used in Wikipedia. You can use the MediaWiki Action API to get the target page of a redirect.

the result can be in JSON format (for example)

all you need is to parse it to get the value of the element to or the element title

This query will retrieve the target page for ‘Halab’:

https://en.wikipedia.org/w/api.php?action=query&titles=Halab&redirects&format=json

Result:

{  
   "batchcomplete":"",
   "query":{  
      "redirects":[  
         {  
            "from":"Halab",
            "to":"Aleppo"
         }
      ],
      "pages":{  
         "159244":{  
            "pageid":159244,
            "ns":0,
            "title":"Aleppo"
         }
      }
   }
}

In Python:

import json
import requests

query = requests.get(r'https://en.wikipedia.org/w/api.php?action=query&titles={}&redirects&format=json'.format('Halab'))

data = json.loads(query.text)
Answered By: Abdulrahman Bres

The answer of Amit Tripathi throws an exception. this is my answer:

res = requests.get(url)
doc = lxml.html.fromstring(res.content)
for t in doc.xpath("//link[contains(@rel, 'canonical')]"):
    new_url = str(t.attrib['href'])

from my experience, there might be a redirection to the same url. so better check (url != new_url) before using new_url.

Answered By: Kathy Razmadze