Retrieving all hrefs in anchor tags

Question:

import warnings
import numpy as np
from datetime import datetime
import json
from bs4 import BeautifulSoup

warnings.filterwarnings('ignore')

url = "https://understat.com/league/EPL/2022"
response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")

for link in soup.find_all("a", class_="match-info"):
    href = link.get("href")
    print(href)

unfortunately this code does not find any results the desired results are the hrefs in this part of the webpage

< a class="match-info" data-isresult="true" href = "match/18265" >

any ideas?

Asked By: Paul Corcoran

||

Answers:

The data you see on the page is encoded inside <script> element, so beautifulsoup doesn’t see it. To decode it and put it into a Pandas dataframe you can use next example:

import re
import json
import requests
import pandas as pd


url = "https://understat.com/league/EPL/2022"
html_doc = requests.get(url).text

data = re.search(r"datesDatas*=s*JSON.parse('(.*?)')", html_doc).group(1)
data = re.sub(r'\x([dA-F]{2})', lambda g: chr(int(g.group(1), 16)), data)
data = json.loads(data)

all_data = []
for d in data:
    all_data.append({
        'Team 1': d['h']['title'],
        'Team 2': d['a']['title'],
        'Goals': f'{d["goals"]["h"]} - {d["goals"]["a"]}',
        'Date': d['datetime'],
        'xG': [d['xG']['h'], d['xG']['a']],
        'forecast': list(d.get('forecast', {}).values())
    })

df = pd.DataFrame(all_data)
print(df)

Prints:

                      Team 1                   Team 2        Goals                 Date                     xG                  forecast
0             Crystal Palace                  Arsenal        0 - 2  2022-08-05 19:00:00     [1.20637, 1.43601]  [0.2864, 0.2912, 0.4224]
1                     Fulham                Liverpool        2 - 2  2022-08-06 11:30:00     [1.26822, 2.34111]  [0.1225, 0.2133, 0.6642]
2                Bournemouth              Aston Villa        2 - 0  2022-08-06 14:00:00   [0.588341, 0.488895]   [0.3213, 0.4397, 0.239]
3                      Leeds  Wolverhampton Wanderers        2 - 1  2022-08-06 14:00:00     [0.88917, 1.10119]  [0.2798, 0.3166, 0.4036]
4           Newcastle United        Nottingham Forest        2 - 0  2022-08-06 14:00:00     [1.8591, 0.235825]  [0.8023, 0.1695, 0.0282]
5                  Tottenham              Southampton        4 - 1  2022-08-06 14:00:00     [1.6172, 0.386546]  [0.7002, 0.2209, 0.0789]
6                    Everton                  Chelsea        0 - 1  2022-08-06 16:30:00    [0.541983, 1.92315]    [0.06, 0.1717, 0.7683]
7          Manchester United                 Brighton        1 - 2  2022-08-07 13:00:00      [1.42103, 1.7289]      [0.281, 0.269, 0.45]
8                  Leicester                Brentford        2 - 2  2022-08-07 13:00:00   [0.455695, 0.931067]  [0.1615, 0.3491, 0.4894]

...and so on.
Answered By: Andrej Kesely

The problem is that those anchors on that page are generated by javascript. They are not part of the response retrieved by requests.get.

You could use a browser tool like selenium to get the page and render its full content as an alternative to using requests. Because selenium controls a browser, it will execute the JS that renders the HTML you’re expecting.

from selenium import webdriver
import time
driver = webdriver.Chrome()
url = ...
driver.get(url)

rendered_html = driver.page_source

soup = BeautifulSoup(rendered_html, "html.parser")

for link in soup.find_all("a", class_="match-info"):
    href = link.get("href")
    print(href)

Alternatively, you could inspect the contents of the page source as returned by requests.get (not the rendered HTML) and rework your parsing based on that content. If the JS makes additional server requests to render a page, you’ll have to consider that as well.

Answered By: sytech
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.