Scrape HTML with Beautiful Soup

Question

I’m having trouble scraping content from the following https://pregame.com/game-center/171763/consensus-archive. I’m using Beautiful Soup and only getting back snippets of the HTML, without any of the data that I can clearly see embedded in the code.

This is the latest iteration of code I’ve used after a handful of attempts (this was an attempt at only grabbing the dates column, but I’m wanting to grab the entire table)…

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://pregame.com/game-center/171763/consensus-archive'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
results = soup.find(class_ = "pg-move-list")
print(results.prettify())
dates = results.find_all("td", class_="pg-col pg-col--date")
for date in dates:
    print(date, end = "n"*2)
for date in dates:
    data_date = date.find("p", class_= "pg-col-data")
print(data_date.text)

The HTML as well…

Realize there are many similar questions here and elsewhere on the web, but I’m still stuck after referencing them. Thank you in advance for the help.

Asked By: Abb

||

Source

Answer 1

Data is generating dynamically from external source via API. Bs4 can’t parse/render JS that’s why are grtting static portion of html only.

Example:

import pandas as pd
import requests

api_url = 'https://pregame.com/api/gamecenter/consensushistory?e=171763&s=40&r=1000&a=1&c=1&t=693'
r = requests.get(api_url)

df = pd.DataFrame(r.json()['Items'])
print(df)

Output:

         Id                  DateTime   Odds  ...  IsPickActionChanged  PickAction  PickPercentage
0    60149470   2021-10-17T12:39:05.18Z    +10  ...                False         172              86
1    60147744  2021-10-17T12:16:32.793Z    +10  ...                False         169              86
2    60146757   2021-10-17T12:00:41.64Z    +10  ...                False         162              86
3    60146458  2021-10-17T11:55:49.823Z    +10  ...                False         162              86
4    60146333  2021-10-17T11:53:50.477Z    +10  ...                False         162              86
..        ...                       ...    ...  ...                  ...         ...             ...
130  59716689  2021-10-12T17:41:27.397Z    +10  ...                False          14              82
131  59716636   2021-10-12T17:40:44.01Z    +10  ...                False          14              82
132  59716531  2021-10-12T17:39:28.603Z    +10  ...                False          14              82
133  59715523  2021-10-12T17:24:22.067Z    +10  ...                False          13              81
134  59655757  2021-10-11T01:02:33.873Z  Other  ...                 True           1             100

[135 rows x 12 columns]

Answered By: F.Hoque

Answer 2

In case this is helpful to anyone. Created the following code to access the API using above answer and then loop through a list of URLs.

# Read CSV file
URLs = read_csv("URLs_1_250.csv")

# Convert column of URLs to a list
odds = URLs['URL'].tolist()

# Create empty dataframe to append new data to
nfl_data = pd.DataFrame()

# Loop through list of urls to scrape odds data
for i in tqdm(range(0,250)):
   url = odds[i]
   r = requests.get(url)
   data = pd.DataFrame(r.json()['Items'])
   nfl_data = nfl_data.append(data, ignore_index = True)
   time.sleep(1)

Answered By: Abb

Scrape HTML with Beautiful Soup

Question:

Answers: