Robotparser doesn't seem to parse correctly

Question:

I am writing a crawler and for this I am implementing the robots.txt parser, I am using the standard lib robotparser.

It seems that robotparser is not parsing correctly, I am debugging my crawler using Google’s robots.txt.

(Following examples are from IPython)

In [1]: import robotparser

In [2]: x = robotparser.RobotFileParser()

In [3]: x.set_url("http://www.google.com/robots.txt")

In [4]: x.read()

In [5]: x.can_fetch("My_Crawler", "/catalogs") # This should return False, since it's on Disallow
Out[5]: False

In [6]: x.can_fetch("My_Crawler", "/catalogs/p?") # This should return True, since it's Allowed
Out[6]: False

In [7]: x.can_fetch("My_Crawler", "http://www.google.com/catalogs/p?")
Out[7]: False

It’s funny because sometimes it seems to “work” and sometimes it seems to fail, I also tried the same with the robots.txt from Facebook and Stackoverflow. Is this a bug from robotpaser module? Or am I doing something wrong here? If so, what?

I was wondering if this bug had anything related

Asked By: user689383

||

Answers:

After a few Google searches I didn’t find anything about robotparser issue. I ended up with something else, I found a module called reppy which I did a few testing and it seems very powerful. You can install it through pip;

pip install reppy

Here are a few examples (on IPython) using reppy, again, using Google’s robots.txt

In [1]: import reppy

In [2]: x = reppy.fetch("http://google.com/robots.txt")

In [3]: x.atts
Out[3]: 
{'agents': {'*': <reppy.agent at 0x1fd9610>},
 'sitemaps': ['http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xml',
  'http://www.google.com/hostednews/sitemap_index.xml',
  'http://www.google.com/sitemaps_webmasters.xml',
  'http://www.google.com/ventures/sitemap_ventures.xml',
  'http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml',
  'http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xml',
  'http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml',
  'http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml']}

In [4]: x.allowed("/catalogs/about", "My_crawler") # Should return True, since it's allowed.
Out[4]: True

In [5]: x.allowed("/catalogs", "My_crawler") # Should return False, since it's not allowed.
Out[5]: False

In [7]: x.allowed("/catalogs/p?", "My_crawler") # Should return True, since it's allowed.
Out[7]: True

In [8]: x.refresh() # Refresh robots.txt, perhaps a magic change?

In [9]: x.ttl
Out[9]: 3721.3556718826294

In [10]: # It also has a x.disallowed function. The contrary of x.allowed
Answered By: user689383

interesting question. i had a look at the source (i only have python 2.4 source available, but i bet it hasn’t changed) and the code normalises the url that is being tested by executing:

urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) 

which is the source of your problems:

>>> urllib.quote(urlparse.urlparse(urllib.unquote("/foo"))[2]) 
'/foo'
>>> urllib.quote(urlparse.urlparse(urllib.unquote("/foo?"))[2]) 
'/foo'

so it’s either a bug in python’s library, or google is breaking the robot.txt specs by including a “?” character in a rule (which is a bit unusual).

[just in case it’s not clear, i’ll say it again in a different way. the code above is used by the robotparser library as part of checking the url. so when the url ends in a “?” that character is dropped. so when you checked for /catalogs/p? the actual test executed was for /catalogs/p. hence your surprising result.]

i’d suggest filing a bug with the python folks (you can post a link to here as part of the explanation) [edit: thanks]. and then using the other library you found…

Answered By: andrew cooke

About a week ago we merged a commit with a bug in it that’s causing this issue. We just pushed version 0.2.2 to pip and master in the repo including a regression test for exactly this issue.

Version 0.2 contains a slight interface change — now you must create a RobotsCache object which contains the exact interface that reppy originally had. This was mostly to make the caching explicit and make it possible to have different caches within the same process. But behold, it now works again!

from reppy.cache import RobotsCache
cache = RobotsCache()
cache.allowed('http://www.google.com/catalogs', 'foo')
cache.allowed('http://www.google.com/catalogs/p', 'foo')
cache.allowed('http://www.google.com/catalogs/p?', 'foo')
Answered By: Dan Lecocq

This isn’t a bug, but rather a difference in interpretation. According to the draft robots.txt specification (which was never approved, nor is it likely to be):

To evaluate if access to a URL is allowed, a robot must attempt to
match the paths in Allow and Disallow lines against the URL, in the
order they occur in the record. The first match found is used. If no
match is found, the default assumption is that the URL is allowed.

(Section 3.2.2, The Allow and Disallow Lines)

Using that interpretation, then “/catalogs/p?” should be rejected because there’s a “Disallow: /catalogs” directive previously.

At some point, Google started interpreting robots.txt differently from that specification. Their method appears to be:

Check for Allow. If it matches, crawl the page.
Check for Disallow. If it matches, don't crawl.
Otherwise, crawl.

The problem is that there is no formal agreement on the interpretation of robots.txt. I’ve seen crawlers that use the Google method and others that use the draft standard from 1996. When I was operating a crawler, I got nastygrams from webmasters when I used the Google interpretation because I crawled pages they thought shouldn’t be crawled, and I got nastygrams from others if I used the other interpretation because stuff they thought should be indexed, wasn’t.

Answered By: Jim Mischel

While this is nearly an ancient issue, I thought I would add my findings.
The problem is with the read() method.

Looking at the source code for read:


    def read(self):
        """Reads the robots.txt URL and feeds it to the parser."""
        try:
            f = urllib.request.urlopen(self.url)
        except urllib.error.HTTPError as err:
            if err.code in (401, 403):
                self.disallow_all = True
            elif err.code >= 400 and err.code < 500:
                self.allow_all = True
        else:
            raw = f.read()
            self.parse(raw.decode("utf-8").splitlines())

You will notice in the exception handler, that HTTPError’s handle the errors silently by setting self.allow_all or disallow_all.

I found that replacing read() with requests(), things work as expected. I prefer to see what is happening and can choose to handle the exceptions as required. This will also allow using httpx for async programs.

for site in sites:
    robot_url = f'https://{site}/robots.txt'
    rp = RobotFileParser()
    r = requests.get(robot_url)
    rp.parse(r.text.splitlines())
    print(f'{site}: {rp.can_fetch("MyUserAgent", "/api/")=}')
    print(f'{site}: {rp.can_fetch("*", "/media_proxy/")=}')

What it is odd, is that using read(), there were a number of HTTPError’s (403); using requests with the same sites, I am not seeing any HTTPErrors. I suspect the sites are blocking the default user-agent used by urllib.

Answered By: EGarbus