Why can't beautifulsoup detect this table from this website?
Question:
I tried to webscrape the table from this website "https://racing.hkjc.com/racing/information/English/Jockey/JockeyRanking.aspx" onto an excel sheet with beautifulsoup and pandas. This is my code.
from bs4 import BeautifulSoup
import pandas as pd
# Send a GET request to the URL
url = "https://racing.hkjc.com/racing/information/English/Jockey/JockeyRanking.aspx"
response = requests.get(url)
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
# Find the table element and extract the data
table = soup.find("table", {"class": "table_bd "})
if table is None:
print("Table not found.")
else:
df = pd.read_html(str(table))[0]
# Save the data to an Excel spreadsheet
df.to_excel("hkjc.xlsx", index=False)
For some reason, It prints "table not found", but there is clearly a table on this website.
Can someone let me know why this is happening please, and how the code can be changed for this issue to be fixed?
Answers:
The issue might be due to the fact that the class name of the table you are trying to extract is "table_bd " with a trailing space. Try removing the trailing space in your find method
Like so
table = soup.find("table", {"class": "table_bd"})
You can use playwright to scrape the content since requests won’t be enough to get the table data. The HTML parsing can still be handled by the BeautifulSoup though.
import pandas as pd
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright
url = "https://racing.hkjc.com/racing/information/English/Jockey/JockeyRanking.aspx"
with sync_playwright() as pw:
browser = pw.chromium.launch()
page = browser.new_page()
page.goto(url, wait_until="networkidle")
soup = BeautifulSoup(page.content(), "html.parser")
table = soup.select_one(".table_bd")
if table is None:
print("Table not found.")
else:
df = pd.read_html(str(table))[0]
# Save the data to an Excel spreadsheet
df.to_excel("hkjc.xlsx", index=True)
I tried to webscrape the table from this website "https://racing.hkjc.com/racing/information/English/Jockey/JockeyRanking.aspx" onto an excel sheet with beautifulsoup and pandas. This is my code.
from bs4 import BeautifulSoup
import pandas as pd
# Send a GET request to the URL
url = "https://racing.hkjc.com/racing/information/English/Jockey/JockeyRanking.aspx"
response = requests.get(url)
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
# Find the table element and extract the data
table = soup.find("table", {"class": "table_bd "})
if table is None:
print("Table not found.")
else:
df = pd.read_html(str(table))[0]
# Save the data to an Excel spreadsheet
df.to_excel("hkjc.xlsx", index=False)
For some reason, It prints "table not found", but there is clearly a table on this website.
Can someone let me know why this is happening please, and how the code can be changed for this issue to be fixed?
The issue might be due to the fact that the class name of the table you are trying to extract is "table_bd " with a trailing space. Try removing the trailing space in your find method
Like so
table = soup.find("table", {"class": "table_bd"})
You can use playwright to scrape the content since requests won’t be enough to get the table data. The HTML parsing can still be handled by the BeautifulSoup though.
import pandas as pd
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright
url = "https://racing.hkjc.com/racing/information/English/Jockey/JockeyRanking.aspx"
with sync_playwright() as pw:
browser = pw.chromium.launch()
page = browser.new_page()
page.goto(url, wait_until="networkidle")
soup = BeautifulSoup(page.content(), "html.parser")
table = soup.select_one(".table_bd")
if table is None:
print("Table not found.")
else:
df = pd.read_html(str(table))[0]
# Save the data to an Excel spreadsheet
df.to_excel("hkjc.xlsx", index=True)