Scraping Google Scholar with urllib2 instead of requests

Question:

I have the simple script below which works just fine for fetching a list of articles from Google Scholar searching for a term of interest.

import urllib
import urllib2
import requests
from bs4 import BeautifulSoup

SEARCH_SCHOLAR_HOST = "https://scholar.google.com"
SEARCH_SCHOLAR_URL = "/scholar"

def searchScholar(searchStr, limit=10):
    """Search Google Scholar for articles and publications containing terms of interest"""
    url = SEARCH_SCHOLAR_HOST + SEARCH_SCHOLAR_URL + "?q=" + urllib.quote_plus(searchStr) + "&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search"
    content = requests.get(url, verify=False).text
    page = BeautifulSoup(content, 'lxml')
    results = {}
    count = 0
    for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
        if count < limit:
            try:
                text = entry.a.text.encode("ascii", "ignore")
                url = entry.a['href']
                results[url] = text 
                count += 1
            except:
                pass
    return results

queryStr = "Albert einstein"
pubs = searchScholar(queryStr, 10)
if len(pubs) == 0:
    print "No articles found"
else:   
    for pub in pubs.keys():
        print pub + ' ' + pubs[pub]

However, I want to run this script as a CGI application on a remote server, without access to console, so I cannot install any external Python modules. (I managed to ‘install’ BeautifulSoup without resorting to pip or easy_install by just copying the bs4 directory to my cgi-bin directory, but this trick does not worked with requests because of its large amount of dependencies.)

So, my question is: is it possible to use the built-in urllib2 or httplib Python modules instead of requests for fetching the Google Scholar page and then pass it to BeautifulSoup? It should be, because I found some code here which scrapes Google Scholar using just the standard libraries plus BeautifulSoup, but it is rather convoluted. I would prefer to achieve a much simpler solution, just be adapting my script for using the standard libraries instead of requests.

Could anyone give me some help?

Asked By: maurobio

||

Answers:

This code is enough to perform a simple request using urllib2:

def get(url):
    req = urllib2.Request(url)
    req.add_header('User-Agent', 'Mozilla/2.0 (compatible; MSIE 5.5; Windows NT)')
    return urllib2.urlopen(req).read()

if you need to do something more advanced in the future it will be more code. What request does is simplifies the usage over that of the standard libs.

Answered By: SimonF