Creating dataframe from beautifulsoup4 result does not work due to structure
Question:
I get "ValueError: No tables found".
I try to scrape html from a website as follows:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import lxml
def getHTMLdocument(url):
response = requests.get(url)
return response.text
url_to_scrape = "https://website.com"
html_document = getHTMLdocument(url_to_scrape)
soup = BeautifulSoup(html_document, 'lxml')
table = soup.find_all('h3')
table
As a output I get the following (which I am totally fine):
<h3><a class="xyz-link" href=""https//address1.com">address1</a></h3>,
<h3><a class="xyz-link" href=""https//address2.com">address2</a></h3>,
...
After that I try using
df = pd.read_html(str(table))[0]
df
but getting "ValueError: No tables found". I think this is because of the structure of my beautifuolSoup result. I want to extract the addresses (e.g. https//address1.com) and also the followed text (address1) to a dataframe. Any ideas?
Answers:
You can try:
data = [{'Link': t['href'], 'Name': t.text} for t in soup.select('h3 > a')]
df = pd.DataFrame(data)
print(df)
# Output
Link Name
0 https//address1.com address1
1 https//address2.com address2
Rather than extract all tags <a>
or <h3>
, we extract all data where <h3>
tag is followed by <a>
with the select
method instead of find_all
. The rest is a list comprehension that creates a dict with the attribute href
and the text. Finally, you can create a Pandas DataFrame with a list of dicts.
I get "ValueError: No tables found".
I try to scrape html from a website as follows:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import lxml
def getHTMLdocument(url):
response = requests.get(url)
return response.text
url_to_scrape = "https://website.com"
html_document = getHTMLdocument(url_to_scrape)
soup = BeautifulSoup(html_document, 'lxml')
table = soup.find_all('h3')
table
As a output I get the following (which I am totally fine):
<h3><a class="xyz-link" href=""https//address1.com">address1</a></h3>,
<h3><a class="xyz-link" href=""https//address2.com">address2</a></h3>,
...
After that I try using
df = pd.read_html(str(table))[0]
df
but getting "ValueError: No tables found". I think this is because of the structure of my beautifuolSoup result. I want to extract the addresses (e.g. https//address1.com) and also the followed text (address1) to a dataframe. Any ideas?
You can try:
data = [{'Link': t['href'], 'Name': t.text} for t in soup.select('h3 > a')]
df = pd.DataFrame(data)
print(df)
# Output
Link Name
0 https//address1.com address1
1 https//address2.com address2
Rather than extract all tags <a>
or <h3>
, we extract all data where <h3>
tag is followed by <a>
with the select
method instead of find_all
. The rest is a list comprehension that creates a dict with the attribute href
and the text. Finally, you can create a Pandas DataFrame with a list of dicts.