Retrieving all hrefs in anchor tags
Question:
import warnings
import numpy as np
from datetime import datetime
import json
from bs4 import BeautifulSoup
warnings.filterwarnings('ignore')
url = "https://understat.com/league/EPL/2022"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for link in soup.find_all("a", class_="match-info"):
href = link.get("href")
print(href)
unfortunately this code does not find any results the desired results are the hrefs in this part of the webpage
< a class="match-info" data-isresult="true" href = "match/18265" >
any ideas?
Answers:
The data you see on the page is encoded inside <script>
element, so beautifulsoup
doesn’t see it. To decode it and put it into a Pandas dataframe you can use next example:
import re
import json
import requests
import pandas as pd
url = "https://understat.com/league/EPL/2022"
html_doc = requests.get(url).text
data = re.search(r"datesDatas*=s*JSON.parse('(.*?)')", html_doc).group(1)
data = re.sub(r'\x([dA-F]{2})', lambda g: chr(int(g.group(1), 16)), data)
data = json.loads(data)
all_data = []
for d in data:
all_data.append({
'Team 1': d['h']['title'],
'Team 2': d['a']['title'],
'Goals': f'{d["goals"]["h"]} - {d["goals"]["a"]}',
'Date': d['datetime'],
'xG': [d['xG']['h'], d['xG']['a']],
'forecast': list(d.get('forecast', {}).values())
})
df = pd.DataFrame(all_data)
print(df)
Prints:
Team 1 Team 2 Goals Date xG forecast
0 Crystal Palace Arsenal 0 - 2 2022-08-05 19:00:00 [1.20637, 1.43601] [0.2864, 0.2912, 0.4224]
1 Fulham Liverpool 2 - 2 2022-08-06 11:30:00 [1.26822, 2.34111] [0.1225, 0.2133, 0.6642]
2 Bournemouth Aston Villa 2 - 0 2022-08-06 14:00:00 [0.588341, 0.488895] [0.3213, 0.4397, 0.239]
3 Leeds Wolverhampton Wanderers 2 - 1 2022-08-06 14:00:00 [0.88917, 1.10119] [0.2798, 0.3166, 0.4036]
4 Newcastle United Nottingham Forest 2 - 0 2022-08-06 14:00:00 [1.8591, 0.235825] [0.8023, 0.1695, 0.0282]
5 Tottenham Southampton 4 - 1 2022-08-06 14:00:00 [1.6172, 0.386546] [0.7002, 0.2209, 0.0789]
6 Everton Chelsea 0 - 1 2022-08-06 16:30:00 [0.541983, 1.92315] [0.06, 0.1717, 0.7683]
7 Manchester United Brighton 1 - 2 2022-08-07 13:00:00 [1.42103, 1.7289] [0.281, 0.269, 0.45]
8 Leicester Brentford 2 - 2 2022-08-07 13:00:00 [0.455695, 0.931067] [0.1615, 0.3491, 0.4894]
...and so on.
The problem is that those anchors on that page are generated by javascript. They are not part of the response retrieved by requests.get
.
You could use a browser tool like selenium
to get the page and render its full content as an alternative to using requests
. Because selenium controls a browser, it will execute the JS that renders the HTML you’re expecting.
from selenium import webdriver
import time
driver = webdriver.Chrome()
url = ...
driver.get(url)
rendered_html = driver.page_source
soup = BeautifulSoup(rendered_html, "html.parser")
for link in soup.find_all("a", class_="match-info"):
href = link.get("href")
print(href)
Alternatively, you could inspect the contents of the page source as returned by requests.get
(not the rendered HTML) and rework your parsing based on that content. If the JS makes additional server requests to render a page, you’ll have to consider that as well.
import warnings
import numpy as np
from datetime import datetime
import json
from bs4 import BeautifulSoup
warnings.filterwarnings('ignore')
url = "https://understat.com/league/EPL/2022"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for link in soup.find_all("a", class_="match-info"):
href = link.get("href")
print(href)
unfortunately this code does not find any results the desired results are the hrefs in this part of the webpage
< a class="match-info" data-isresult="true" href = "match/18265" >
any ideas?
The data you see on the page is encoded inside <script>
element, so beautifulsoup
doesn’t see it. To decode it and put it into a Pandas dataframe you can use next example:
import re
import json
import requests
import pandas as pd
url = "https://understat.com/league/EPL/2022"
html_doc = requests.get(url).text
data = re.search(r"datesDatas*=s*JSON.parse('(.*?)')", html_doc).group(1)
data = re.sub(r'\x([dA-F]{2})', lambda g: chr(int(g.group(1), 16)), data)
data = json.loads(data)
all_data = []
for d in data:
all_data.append({
'Team 1': d['h']['title'],
'Team 2': d['a']['title'],
'Goals': f'{d["goals"]["h"]} - {d["goals"]["a"]}',
'Date': d['datetime'],
'xG': [d['xG']['h'], d['xG']['a']],
'forecast': list(d.get('forecast', {}).values())
})
df = pd.DataFrame(all_data)
print(df)
Prints:
Team 1 Team 2 Goals Date xG forecast
0 Crystal Palace Arsenal 0 - 2 2022-08-05 19:00:00 [1.20637, 1.43601] [0.2864, 0.2912, 0.4224]
1 Fulham Liverpool 2 - 2 2022-08-06 11:30:00 [1.26822, 2.34111] [0.1225, 0.2133, 0.6642]
2 Bournemouth Aston Villa 2 - 0 2022-08-06 14:00:00 [0.588341, 0.488895] [0.3213, 0.4397, 0.239]
3 Leeds Wolverhampton Wanderers 2 - 1 2022-08-06 14:00:00 [0.88917, 1.10119] [0.2798, 0.3166, 0.4036]
4 Newcastle United Nottingham Forest 2 - 0 2022-08-06 14:00:00 [1.8591, 0.235825] [0.8023, 0.1695, 0.0282]
5 Tottenham Southampton 4 - 1 2022-08-06 14:00:00 [1.6172, 0.386546] [0.7002, 0.2209, 0.0789]
6 Everton Chelsea 0 - 1 2022-08-06 16:30:00 [0.541983, 1.92315] [0.06, 0.1717, 0.7683]
7 Manchester United Brighton 1 - 2 2022-08-07 13:00:00 [1.42103, 1.7289] [0.281, 0.269, 0.45]
8 Leicester Brentford 2 - 2 2022-08-07 13:00:00 [0.455695, 0.931067] [0.1615, 0.3491, 0.4894]
...and so on.
The problem is that those anchors on that page are generated by javascript. They are not part of the response retrieved by requests.get
.
You could use a browser tool like selenium
to get the page and render its full content as an alternative to using requests
. Because selenium controls a browser, it will execute the JS that renders the HTML you’re expecting.
from selenium import webdriver
import time
driver = webdriver.Chrome()
url = ...
driver.get(url)
rendered_html = driver.page_source
soup = BeautifulSoup(rendered_html, "html.parser")
for link in soup.find_all("a", class_="match-info"):
href = link.get("href")
print(href)
Alternatively, you could inspect the contents of the page source as returned by requests.get
(not the rendered HTML) and rework your parsing based on that content. If the JS makes additional server requests to render a page, you’ll have to consider that as well.