How can I un-shorten a URL using python?

Question:

I have seen this thread already – How can I unshorten a URL?

My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshortening youtube links. Since unshort.me is used readily, this returns almost 90% of the results with captchas which I am unable to resolve.

So far I am stuck with using:

def unshorten_url(url):
    resolvedURL = urllib2.urlopen(url)  
    print resolvedURL.url

    #t = Test()
    #c = pycurl.Curl()
    #c.setopt(c.URL, 'http://api.unshort.me/?r=%s&t=xml' % (url))
    #c.setopt(c.WRITEFUNCTION, t.body_callback)
    #c.perform()
    #c.close()
    #dom = xml.dom.minidom.parseString(t.contents)
    #resolvedURL = dom.getElementsByTagName("resolvedURL")[0].firstChild.nodeValue
    return resolvedURL.url

Note: everything in the comments is what I tried to do when using the unshort.me service which was returning captcha links.

Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?

Asked By: brandonmat

||

Answers:

You DO have to open it, otherwise you won’t know what URL it will redirect to. As Greg put it:

A short link is a key into somebody else’s database; you can’t expand the link without querying the database

Now to your question.

Does anyone know of a more efficient way to complete this operation
without using open (since it is a waste of bandwidth)?

The more efficient way is to not close the connection, keep it open in the background, by using HTTP’s Connection: keep-alive.

After a small test, unshorten.me seems to take the HEAD method into account and doing a redirect to itself:

> telnet unshorten.me 80
Trying 64.202.189.170...
Connected to unshorten.me.
Escape character is '^]'.
HEAD http://unshort.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp HTTP/1.1
Host: unshorten.me

HTTP/1.1 301 Moved Permanently
Date: Mon, 22 Aug 2011 20:42:46 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Location: http://resolves.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp
Cache-Control: private
Content-Length: 0

So if you use the HEAD HTTP method, instead of GET, you will actually end up doing the same work twice.

Instead, you should keep the connection alive, which will save you only a little bandwidth, but what it will certainly save is the latency of establishing a new connection every time. Establishing a TCP/IP connection is expensive.

You should get away with a number of kept-alive connections to the unshorten service equal to the number of concurrent connections your own service receives.

You could manage these connections in a pool. That’s the closest you can get. Beside tweaking your kernel‘s TCP/IP stack.

Answered By: Flavius

Use the best rated answer (not the accepted answer) in that question:

# This is for Py2k.  For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse

def unshorten_url(url):
    parsed = urlparse.urlparse(url)
    h = httplib.HTTPConnection(parsed.netloc)
    resource = parsed.path
    if parsed.query != "":
        resource += "?" + parsed.query
    h.request('HEAD', resource )
    response = h.getresponse()
    if response.status/100 == 3 and response.getheader('Location'):
        return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
    else:
        return url
Answered By: Pedro Loureiro

one line functions, using requests library and yes, it supports recursion.

def unshorten_url(url):
    return requests.head(url, allow_redirects=True).url
Answered By: bersam

Here a src code that takes into account almost of the useful corner cases:

  • set a custom Timeout.
  • set a custom User Agent.
  • check whether we have to use an http or https connection.
  • resolve recursively the input url and prevent ending within a loop.

The src code is on github @ https://github.com/amirkrifa/UnShortenUrl

comments are welcome …

import logging
logging.basicConfig(level=logging.DEBUG)

TIMEOUT = 10
class UnShortenUrl:
    def process(self, url, previous_url=None):
        logging.info('Init url: %s'%url)
        import urlparse
        import httplib
        try:
            parsed = urlparse.urlparse(url)
            if parsed.scheme == 'https':
                h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
            else:
                h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
            resource = parsed.path
            if parsed.query != "": 
                resource += "?" + parsed.query
            try:
                h.request('HEAD', 
                          resource, 
                          headers={'User-Agent': 'curl/7.38.0'}

                          )
                response = h.getresponse()
            except:
                import traceback
                traceback.print_exec()
                return url
            logging.info('Response status: %d'%response.status)
            if response.status/100 == 3 and response.getheader('Location'):
                red_url = response.getheader('Location')
                logging.info('Red, previous: %s, %s'%(red_url, previous_url))
                if red_url == previous_url:
                    return red_url
                return self.process(red_url, previous_url=url) 
            else:
                return url 
        except:
            import traceback
            traceback.print_exc()
            return None
Answered By: Amir Krifa
import requests

short_url = "<your short url goes here>"
long_url = requests.get(short_url).url
print(long_url)
Answered By: dinesh mane