Save html to file to work with later using Beautiful Soup

Question

I am doing a lot of work with Beautiful Soup. However, my supervisor does not want me doing the work "in real time" from the web. Instead, he wants me to download all the text from a webpage and then work on it later. He wants to avoid repeated hits on a website.

Here is my code:

import requests
from bs4 import BeautifulSoup

url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ' 
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')

I am unsure whether I should save "page" as a file and then import that into Beautiful Soup, or whether I should save "soup" as a file to open later. I also do not know how to save this as a file in a way that can be accessed as if it were "live" from the internet. I know almost nothing about Python, so I need the absolute easiest and simplest process for this.

Asked By: mdign002

||

Source

Answer 1

So saving soup would be… tough, and out of my experience (read more about the pickleing process if interested). You can save the page as follows:

page = requests.get(url)
with open('path/to/saving.html', 'wb+') as f:
    f.write(page.content)

Then later, when you want to do analysis on it:

with open('path/to/saving.html', 'rb') as f:
    soup = BeautifulSoup(f.read(), 'lxml')

Something like that, anyway.

Answered By: William Bradley

Answer 2

The following code iterates over url_list and saves all the responses into the list all_pages, which is stored to the response.pickle file.

import pickle
import requests
from bs4 import BeautifulSoup

all_pages = []
for url in url_list:
    all_pages.append(requests.get(url))

with open("responses.pickle", "wb") as f:
    pickle.dump(all_pages, f)

Then later on, you can load this data, "soupify" each response and do whatever you need with it.

with open("responses.pickle", "rb") as f:
    all_pages = pickle.load(f)

for page in all_pages:
    soup = BeautifulSoup(page.text, 'lxml')
    # do stuff

Answered By: Charles Dupont

Answer 3

Working with our request:

url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ' 
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')

you can use this also:

f=open("path/page.html","w")
f.write(page.prettify())
f.close

Answered By: Utz Wilke

Save html to file to work with later using Beautiful Soup

Question:

Answers: