Creating dataframe from beautifulsoup4 result does not work due to structure

Question:

I get "ValueError: No tables found".

I try to scrape html from a website as follows:

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import lxml

def getHTMLdocument(url):

    response = requests.get(url) 

    return response.text


url_to_scrape = "https://website.com"
html_document = getHTMLdocument(url_to_scrape)

soup = BeautifulSoup(html_document, 'lxml')
table = soup.find_all('h3')

table

As a output I get the following (which I am totally fine):

<h3><a class="xyz-link" href=""https//address1.com">address1</a></h3>,
<h3><a class="xyz-link" href=""https//address2.com">address2</a></h3>,
...

After that I try using

df = pd.read_html(str(table))[0]
df

but getting "ValueError: No tables found". I think this is because of the structure of my beautifuolSoup result. I want to extract the addresses (e.g. https//address1.com) and also the followed text (address1) to a dataframe. Any ideas?

Asked By: question12

||

Answers:

You can try:

data = [{'Link': t['href'], 'Name': t.text} for t in soup.select('h3 > a')]
df = pd.DataFrame(data)
print(df)

# Output
                  Link      Name
0  https//address1.com  address1
1  https//address2.com  address2

Rather than extract all tags <a> or <h3>, we extract all data where <h3> tag is followed by <a> with the select method instead of find_all. The rest is a list comprehension that creates a dict with the attribute href and the text. Finally, you can create a Pandas DataFrame with a list of dicts.

Answered By: Corralien
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.