Problem with detecting if link is invalid

Question:

Is there any way to detect if a link is invalid using webbot?
I need to tell the user that the link they provided was unreachable.

Asked By: susthebus

||

Answers:

The only way to be completely sure that a url sends you to a valid page is to fetch that page and check it works. You could try making a request other than GET to try to avoid the wasted bandwith downloading the page, but not all servers will respond: the only way to be absolutely sure is to GET and see what happens. Something like:

import requests
from requests.exceptions import ConnectionError

def check_url(url):
    try:
        r = requests.get(url, timeout=1)
        return r.status_code == 200
    except ConnectionError:
        return False

Is this a good idea? It’s only a GET request, and get is supposed to idempotent, so you shouldn’t cause anybody any harm. On the other hand, what if a user sets up a script to add a new link every second pointing to the same website? Then you’re DDOSing that website. So when you allow users to cause your server to do things like this, you need to think how you might protect it. (In this case: you could keep a cache of valid links expiring every n seconds, and only look up if the cache doesn’t hold the link.)

Note that if you just want to check the link points to a valid domain it’s a bit easier: you can just do a dns query. (The same point about caching and avoiding abuse probably applies.)

Note that I used requests, because it is easy, but you likely want to do this in the bacground, either with requests in a thread, or with one of the asyncio http libraries and an asyncio event loop. Otherwise your code will block for at least timeout seconds.

(Another attack: this gets the whole page. What if a user links to a massive page? See this question for a discussion of protecting from oversize responses. For your use case you likely just want to get a few bytes. I’ve deliberately not complicated the example code here because I wanted to illustrate the principle.)

Note that this just checks that something is available on that page. What if it’s one of the many dead links which redirects to a domain-name website? You could enforce ‘no redirects’—but then some redirects are valid. (Likewise, you could try to detect redirects up to the main domain or to a blacklist of venders’ domains, but this will always be imperfect.) There is a tradeoff here to consider, which depends on your concrete use case, but it’s worth being aware of.

Answered By: 2e0byo

You could try sending an HTTP request, opening the result, and have a list of known error codes, 404, etc. You can easily implement this in Python and is efficient and quick. Be warned that SOMETIMES (quite rarely) a website might detect your scraper and artificially return an Error Code to confuse you.

Answered By: Sergei Kiselev
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.