Save html to file to work with later using Beautiful Soup
Question:
I am doing a lot of work with Beautiful Soup. However, my supervisor does not want me doing the work "in real time" from the web. Instead, he wants me to download all the text from a webpage and then work on it later. He wants to avoid repeated hits on a website.
Here is my code:
import requests
from bs4 import BeautifulSoup
url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
I am unsure whether I should save "page" as a file and then import that into Beautiful Soup, or whether I should save "soup" as a file to open later. I also do not know how to save this as a file in a way that can be accessed as if it were "live" from the internet. I know almost nothing about Python, so I need the absolute easiest and simplest process for this.
Answers:
So saving soup would be… tough, and out of my experience (read more about the pickle
ing process if interested). You can save the page as follows:
page = requests.get(url)
with open('path/to/saving.html', 'wb+') as f:
f.write(page.content)
Then later, when you want to do analysis on it:
with open('path/to/saving.html', 'rb') as f:
soup = BeautifulSoup(f.read(), 'lxml')
Something like that, anyway.
The following code iterates over url_list
and saves all the responses into the list all_pages
, which is stored to the response.pickle
file.
import pickle
import requests
from bs4 import BeautifulSoup
all_pages = []
for url in url_list:
all_pages.append(requests.get(url))
with open("responses.pickle", "wb") as f:
pickle.dump(all_pages, f)
Then later on, you can load this data, "soupify" each response and do whatever you need with it.
with open("responses.pickle", "rb") as f:
all_pages = pickle.load(f)
for page in all_pages:
soup = BeautifulSoup(page.text, 'lxml')
# do stuff
Working with our request:
url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
you can use this also:
f=open("path/page.html","w")
f.write(page.prettify())
f.close
I am doing a lot of work with Beautiful Soup. However, my supervisor does not want me doing the work "in real time" from the web. Instead, he wants me to download all the text from a webpage and then work on it later. He wants to avoid repeated hits on a website.
Here is my code:
import requests
from bs4 import BeautifulSoup
url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
I am unsure whether I should save "page" as a file and then import that into Beautiful Soup, or whether I should save "soup" as a file to open later. I also do not know how to save this as a file in a way that can be accessed as if it were "live" from the internet. I know almost nothing about Python, so I need the absolute easiest and simplest process for this.
So saving soup would be… tough, and out of my experience (read more about the pickle
ing process if interested). You can save the page as follows:
page = requests.get(url)
with open('path/to/saving.html', 'wb+') as f:
f.write(page.content)
Then later, when you want to do analysis on it:
with open('path/to/saving.html', 'rb') as f:
soup = BeautifulSoup(f.read(), 'lxml')
Something like that, anyway.
The following code iterates over url_list
and saves all the responses into the list all_pages
, which is stored to the response.pickle
file.
import pickle
import requests
from bs4 import BeautifulSoup
all_pages = []
for url in url_list:
all_pages.append(requests.get(url))
with open("responses.pickle", "wb") as f:
pickle.dump(all_pages, f)
Then later on, you can load this data, "soupify" each response and do whatever you need with it.
with open("responses.pickle", "rb") as f:
all_pages = pickle.load(f)
for page in all_pages:
soup = BeautifulSoup(page.text, 'lxml')
# do stuff
Working with our request:
url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
you can use this also:
f=open("path/page.html","w")
f.write(page.prettify())
f.close