Beautifulsoup not extracting a <a> tag from a dynamic website
Question:
I have a website https://dip.bundestag.de/aktivit%C3%A4t/Dr–Holger-Becker-MdB-SPD/1628877 and I want to extract the HTML connected to "BT-Plenarprotokoll 20/86, S. 10313C". The HTML chunk is:
<a title="PDF Bundestags-Plenarprotokoll öffnen" aria-label="BT-Plenarprotokoll" href="https://dserver.bundestag.de/btp/20/20086.pdf#P.10313" target="_self" class="hsbfb4-0 sc-1xaeas4-1 hTYfHF FZiNn"><svg viewBox="0 0 10 12" class="sc-1c5ggr5-17 cYBAUx"><g stroke="currentColor" fill="none" fill-rule="evenodd"><path d="M6.14.5H.5v11h9V3.86z"></path><path d="M5.56 2.01v2.51H9.5"></path></g></svg><span class="sc-1xaeas4-3 iZuhXx">BT-Plenarprotokoll 20/86, S. 10313C</span></a>
For any reason, BeautifulSoup does not recognize any tags on this webpage. I tried different codes:
from bs4 import BeautifulSoup
url = "https://dip.bundestag.de/aktivit%C3%A4t/Dr--Holger-Becker-MdB-SPD/1628877"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find the anchor tag with title and aria-label attributes and extract the href attribute
a_tag = soup.findAll('a', {'title': 'PDF Bundestags-Plenarprotokoll öffnen', 'aria-label': 'BT-Plenarprotokoll'})
and
url = "https://dip.bundestag.de/aktivit%C3%A4t/Dr--Holger-Becker-MdB-SPD/1628877"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find the anchor tag with title and aria-label attributes and extract the href attribute
a_tag = soup.findAll('a')
In both cases, a_tag is an empty object, and I don’t understand, since this webpage has more than one link.
Answers:
Use their API to get the data:
Note: The API Key is valid thru May and can be found here.
For example:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'Authorization': 'ApiKey GmEPb1B.bfqJLIhcGAsH9fTJevTglhFpCoZyAAAdhp',
}
url = "https://search.dip.bundestag.de/api/v1/aktivitaet?f.id=1628877"
documents = requests.get(url, headers=headers).json()["documents"]
print(documents[0]["fundstelle"]["pdf_url"])
Output:
I have a website https://dip.bundestag.de/aktivit%C3%A4t/Dr–Holger-Becker-MdB-SPD/1628877 and I want to extract the HTML connected to "BT-Plenarprotokoll 20/86, S. 10313C". The HTML chunk is:
<a title="PDF Bundestags-Plenarprotokoll öffnen" aria-label="BT-Plenarprotokoll" href="https://dserver.bundestag.de/btp/20/20086.pdf#P.10313" target="_self" class="hsbfb4-0 sc-1xaeas4-1 hTYfHF FZiNn"><svg viewBox="0 0 10 12" class="sc-1c5ggr5-17 cYBAUx"><g stroke="currentColor" fill="none" fill-rule="evenodd"><path d="M6.14.5H.5v11h9V3.86z"></path><path d="M5.56 2.01v2.51H9.5"></path></g></svg><span class="sc-1xaeas4-3 iZuhXx">BT-Plenarprotokoll 20/86, S. 10313C</span></a>
For any reason, BeautifulSoup does not recognize any tags on this webpage. I tried different codes:
from bs4 import BeautifulSoup
url = "https://dip.bundestag.de/aktivit%C3%A4t/Dr--Holger-Becker-MdB-SPD/1628877"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find the anchor tag with title and aria-label attributes and extract the href attribute
a_tag = soup.findAll('a', {'title': 'PDF Bundestags-Plenarprotokoll öffnen', 'aria-label': 'BT-Plenarprotokoll'})
and
url = "https://dip.bundestag.de/aktivit%C3%A4t/Dr--Holger-Becker-MdB-SPD/1628877"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find the anchor tag with title and aria-label attributes and extract the href attribute
a_tag = soup.findAll('a')
In both cases, a_tag is an empty object, and I don’t understand, since this webpage has more than one link.
Use their API to get the data:
Note: The API Key is valid thru May and can be found here.
For example:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'Authorization': 'ApiKey GmEPb1B.bfqJLIhcGAsH9fTJevTglhFpCoZyAAAdhp',
}
url = "https://search.dip.bundestag.de/api/v1/aktivitaet?f.id=1628877"
documents = requests.get(url, headers=headers).json()["documents"]
print(documents[0]["fundstelle"]["pdf_url"])
Output: