I think I have to use
urlopen to open each url and then use
urlretrieve to download each pgn by accessing it from the download button near the bottom of each game. Do I have to create a new
BeautifulSoup object for each game? I’m also unsure of how
import urllib from urllib.request import urlopen, urlretrieve, quote from bs4 import BeautifulSoup url = 'http://www.chessgames.com/perl/chesscollection?cid=1014492' u = urlopen(url) html = u.read().decode('utf-8') soup = BeautifulSoup(html, "html.parser") for link in soup.find_all('a'): urlopen('http://chessgames.com'+link.get('href'))
There is no short answer to your question. I will show you a complete solution and comment this code.
First, import necessary modules:
from bs4 import BeautifulSoup import requests import re
Next, get index page and create
req = requests.get("http://www.chessgames.com/perl/chesscollection?cid=1014492") soup = BeautifulSoup(req.text, "lxml")
I strongly advice to use
lxml parser, not common
After that, you should prepare game’s links list:
pages = soup.findAll('a', href=re.compile('.*chessgame?.*'))
You can do it by searching links containing ‘chessgame’ word in it.
Now, you should prepare function which will download files for you:
def download_file(url): path = url.split('/')[-1].split('?') r = requests.get(url, stream=True) if r.status_code == 200: with open(path, 'wb') as f: for chunk in r: f.write(chunk)
And final magic is to repeat all previous steps preparing links for file downloader:
host = 'http://www.chessgames.com' for page in pages: url = host + page.get('href') req = requests.get(url) soup = BeautifulSoup(req.text, "lxml") file_link = soup.find('a',text=re.compile('.*download.*')) file_url = host + file_link.get('href') download_file(file_url)
(first you search links containing text ‘download’ in their description, then construct full url – concatenate hostname and path, and finally download file)
I hope you can use this code without correction!
The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use
urllib3‘s connection pooling. So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase (see HTTP persistent connection).
Next, asyncio, multiprocessing or multithreading are available to parallelize the workload. Each has tradeoffs respective to the task at hand and which you choose is likely best determined by benchmarking and profiling. This page offers great examples of all three.
For the purposes of this post, I’ll show multithreading. The impact of the GIL shouldn’t be too much of a bottleneck because the tasks are mostly IO-bound, consisting of babysitting requests on the air to wait for the response. When a thread is blocked on IO, it can yield to a thread parsing HTML or doing other CPU-bound work.
Here’s the code:
import os import re import requests from bs4 import BeautifulSoup from concurrent.futures import ThreadPoolExecutor def download_pgn(task): session, host, page, destination_path = task response = session.get(host + page) response.raise_for_status() soup = BeautifulSoup(response.text, "lxml") game_url = host + soup.find("a", text="download").get("href") filename = re.search(r"w+.pgn", game_url).group() path = os.path.join(destination_path, filename) response = session.get(game_url, stream=True) response.raise_for_status() with open(path, "wb") as f: for chunk in response.iter_content(chunk_size=1024): if chunk: f.write(chunk) def main(): host = "http://www.chessgames.com" url_to_scrape = host + "/perl/chesscollection?cid=1014492" destination_path = "pgns" max_workers = 8 if not os.path.exists(destination_path): os.makedirs(destination_path) with requests.Session() as session: session.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/188.8.131.52 Safari/537.36" response = session.get(url_to_scrape) response.raise_for_status() soup = BeautifulSoup(response.text, "lxml") pages = soup.find_all("a", href=re.compile(r".*chessgame?.*")) tasks = [ (session, host, page.get("href"), destination_path) for page in pages ] with ThreadPoolExecutor(max_workers=max_workers) as pool: pool.map(download_pgn, tasks) if __name__ == "__main__": main()
response.iter_content here which is unnecessary on such tiny text files, but is a generalization so the code will handle larger files in a memory-friendly way.
Results from a rough benchmark (the first request is a bottleneck):