Why can't beautifulsoup detect this table from this website?

Question:

I tried to webscrape the table from this website "https://racing.hkjc.com/racing/information/English/Jockey/JockeyRanking.aspx" onto an excel sheet with beautifulsoup and pandas. This is my code.

from bs4 import BeautifulSoup
import pandas as pd

# Send a GET request to the URL
url = "https://racing.hkjc.com/racing/information/English/Jockey/JockeyRanking.aspx"
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Find the table element and extract the data
table = soup.find("table", {"class": "table_bd "})
if table is None:
    print("Table not found.")
else:
    df = pd.read_html(str(table))[0]
    # Save the data to an Excel spreadsheet
    df.to_excel("hkjc.xlsx", index=False)

For some reason, It prints "table not found", but there is clearly a table on this website.

Can someone let me know why this is happening please, and how the code can be changed for this issue to be fixed?

Asked By: Nicholas Chan

||

Answers:

The issue might be due to the fact that the class name of the table you are trying to extract is "table_bd " with a trailing space. Try removing the trailing space in your find method

Like so

table = soup.find("table", {"class": "table_bd"})
Answered By: sketchtheme

You can use playwright to scrape the content since requests won’t be enough to get the table data. The HTML parsing can still be handled by the BeautifulSoup though.

import pandas as pd
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright

url = "https://racing.hkjc.com/racing/information/English/Jockey/JockeyRanking.aspx"

with sync_playwright() as pw:
    browser = pw.chromium.launch()
    page = browser.new_page()
    page.goto(url, wait_until="networkidle")
    soup = BeautifulSoup(page.content(), "html.parser")
    table = soup.select_one(".table_bd")

    if table is None:
        print("Table not found.")
    else:
        df = pd.read_html(str(table))[0]
        # Save the data to an Excel spreadsheet
        df.to_excel("hkjc.xlsx", index=True)
Answered By: Joshua