BS4: Doesn't detect all tags with find_all
Question:
Im’ trying to webscraping this url: https://baloncestoenvivo.feb.es/partido/2218269
And I try to get all the div’s with this class = "box-datos-partido". When I try to get all of them with:
soup.find_all("div", class_="box-datos-partido")
I’ve got only one of the two div’s there are in the web page. I’ve got an array with only one element. The content of this element is:
<div class="box-datos-partido">
<div class="fecha">
<span class="label">Fecha</span>
<span class="txt">31/10/2021 - 12:00</span>
</div>
<div class="arbitros">
<span class="label">Árbitros</span>
<span class="txt referee">DIAZ DE SARRALDE MARTIN, IÑIGO</span>
<span class="txt referee">SANCHEZ NUÑEZ, UNAI</span>
<span class="txt referee"></span>
</div>
<div class="pista">
<span class="label">Pista</span>
<span class="txt pabellon">POLIDEPORTIVO URRETA</span>
<span class="txt direccion">Galdakao (Vizcaya)</span>
</div>
</div>
When we should be receive an array with two elements. The content of this two elements should be:
<div class="box-datos-partido">
<div class="fecha">
<span class="label">Fecha</span>
<span class="txt">31-10-2021 - 12:00</span>
</div>
<div class="arbitros">
<span class="label">Árbitros</span>
<span class="txt referee">DIAZ DE SARRALDE MARTIN, IÑIGO</span><span class="txt referee">SANCHEZ NUÑEZ, UNAI</span><span class="txt referee"></span>
</div>
<div class="pista">
<span class="label">Pista</span>
<span class="txt pabellon">POLIDEPORTIVO URRETA</span><span class="txt direccion">BIZKAIA KALEA, S/N, Vizcaya (Galdakao)</span>
</div>
</div>
<div class="box-datos-partido">
<div class="fecha">
<span class="label">Fecha</span>
<span class="txt">31/10/2021 - 12:00</span>
</div>
<div class="arbitros">
<span class="label">Árbitros</span>
<span class="txt referee">DIAZ DE SARRALDE MARTIN, IÑIGO</span>
<span class="txt referee">SANCHEZ NUÑEZ, UNAI</span>
<span class="txt referee"></span>
</div>
<div class="pista">
<span class="label">Pista</span>
<span class="txt pabellon">POLIDEPORTIVO URRETA</span>
<span class="txt direccion">Galdakao (Vizcaya)</span>
</div>
</div>
How is that possible? What am I doing wrong to receive only one element of the two?
Answers:
The data you see is loaded via JavaScript from external URL. To load it, you can use requests
module (this example will load the players into 2 pandas dataframes):
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {
"Authorization": "Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImQzOWE5MzlhZTQyZmFlMTM5NWJjODNmYjcwZjc1ZDc3IiwidHlwIjoiSldUIn0.eyJuYmYiOjE2NTkyNjM1MDUsImV4cCI6MTY1OTM0OTkwNSwiaXNzIjoiaHR0cHM6Ly9pbnRyYWZlYi5mZWIuZXMvaWRlbnRpdHkuYXBpIiwiYXVkIjpbImh0dHBzOi8vaW50cmFmZWIuZmViLmVzL2lkZW50aXR5LmFwaS9yZXNvdXJjZXMiLCJsaXZlc3RhdHMuYXBpIl0sImNsaWVudF9pZCI6ImJhbG9uY2VzdG9lbnZpdm9hcHAiLCJpZGFtYml0byI6IjEiLCJyb2xlIjpbIk92ZXJWaWV3IiwiVGVhbVN0YXRzIiwiU2hvdENoYXJ0IiwiUmFua2luZyIsIktleUZhY3RzIiwiQm94U2NvcmUiXSwic2NvcGUiOlsibGl2ZXN0YXRzLmFwaSJdfQ.YDVnzLhZAw8kzE2LLjiS8VZayY-sfUgqMN4zdnjROLImHRamOJ_Htz4ehK26QcpywfZmrD5iUWnFnRFJrJyZdhudOp09B0tmn4HnWs4JHcQBirUpdLi4oDqONctn1J31OktVhHYpAS36Fs-2KTjwHcgR4G-EQsA6vxjkLKYjw6we0oY5w1Q_GUqRmEvfDQY3b2a-VlFEcxMQBS6XFfEL4naSz84w9aW2e7UCnic_Mm4CHzN1RzitcBSiunQyINshQzg-1G4STARAZZjfaVZCP8SDB4bWeuaXYxkwX40vbisJD8mXFP1xN93THlIg-d0LNfZg8iqD0Lx8xRf9nRdXug"
}
url = "https://intrafeb.feb.es/LiveStats.API/api/v1/BoxScore/2218269"
data = requests.get(url, headers=headers).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
t1 = data["BOXSCORE"]["TEAM"][0]["PLAYER"]
t2 = data["BOXSCORE"]["TEAM"][1]["PLAYER"]
df1 = pd.DataFrame(t1)
df2 = pd.DataFrame(t2)
print(df1)
print(df2)
Prints:
p1m p1a p1p p2m p2a p2p p3m p3a p3p fgm fga fgp min minFormatted sta bs tc mt ro rd rt rf to st ind pllss val assist reb pf pts inn id no name logo
0 4 6 66,7 0 5 0,0 0 6 0,0 0 11 0,0 1812 30:12 None 0 0 0 0 3 3 5 6 1 None -1 None 1 3 1 4 1 2188507 0 J. ROYALE SACRISTAN https://competiciones.feb.es/estadisticas/Foto.aspx?c=2188507
1 0 0 0,0 0 5 0,0 0 0 0,0 0 5 0,0 1021 17:01 None 0 0 0 1 5 6 0 2 1 None -20 None 0 6 0 0 0 2188508 2 O. ARENAS DE LA HOZ https://competiciones.feb.es/estadisticas/Foto.aspx?c=2188508
2 0 0 0,0 1 2 50,0 0 1 0,0 1 3 33,3 1363 22:43 None 0 0 0 0 2 2 1 2 1 None -4 None 1 2 0 2 0 2277838 4 A. RAMASCO CERECERO https://competiciones.feb.es/estadisticas/Foto.aspx?c=2277838
...
Actually, two divs with the same class = "box-datos-partido"
that’s right but if you make disabled JavaScript then you will notice that the same selection is selecting only one of them(first one) because rest of them are loaded dynamically by JavaScript. If you want to pull them then you can take help with an automation tool something like selenium. Here I use selenium with bs4 to grab the right divs with html content.
Example:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url='https://baloncestoenvivo.feb.es/partido/2218269'
driver.get(url)
driver.maximize_window()
time.sleep(5)
soup=BeautifulSoup(driver.page_source,'lxml')
for card in soup.select('div.box-datos-partido'):
print(card.prettify())
Output:
<div class="box-datos-partido">
<div class="fecha">
<span class="label">
Fecha
</span>
<span class="txt">
31-10-2021 - 12:00
</span>
</div>
<div class="arbitros">
<span class="label">
Árbitros
</span>
<span class="txt referee">
DIAZ DE SARRALDE MARTIN, IÑIGO
</span>
<span class="txt referee">
SANCHEZ NUÑEZ, UNAI
</span>
<span class="txt referee">
</span>
</div>
<div class="pista">
<span class="label">
Pista
</span>
<span class="txt pabellon">
POLIDEPORTIVO URRETA
</span>
<span class="txt direccion">
BIZKAIA KALEA, S/N, Vizcaya (Galdakao)
</span>
</div>
</div>
<div class="box-datos-partido">
<div class="fecha">
<span class="label">
Fecha
</span>
<span class="txt">
31/10/2021 - 12:00
</span>
</div>
<div class="arbitros">
<span class="label">
Árbitros
</span>
<span class="txt referee">
DIAZ DE SARRALDE MARTIN, IÑIGO
</span>
<span class="txt referee">
SANCHEZ NUÑEZ, UNAI
</span>
<span class="txt referee">
</span>
</div>
<div class="pista">
<span class="label">
Pista
</span>
<span class="txt pabellon">
POLIDEPORTIVO URRETA
</span>
<span class="txt direccion">
Galdakao (Vizcaya)
</span>
</div>
</div>
Im’ trying to webscraping this url: https://baloncestoenvivo.feb.es/partido/2218269
And I try to get all the div’s with this class = "box-datos-partido". When I try to get all of them with:
soup.find_all("div", class_="box-datos-partido")
I’ve got only one of the two div’s there are in the web page. I’ve got an array with only one element. The content of this element is:
<div class="box-datos-partido">
<div class="fecha">
<span class="label">Fecha</span>
<span class="txt">31/10/2021 - 12:00</span>
</div>
<div class="arbitros">
<span class="label">Árbitros</span>
<span class="txt referee">DIAZ DE SARRALDE MARTIN, IÑIGO</span>
<span class="txt referee">SANCHEZ NUÑEZ, UNAI</span>
<span class="txt referee"></span>
</div>
<div class="pista">
<span class="label">Pista</span>
<span class="txt pabellon">POLIDEPORTIVO URRETA</span>
<span class="txt direccion">Galdakao (Vizcaya)</span>
</div>
</div>
When we should be receive an array with two elements. The content of this two elements should be:
<div class="box-datos-partido">
<div class="fecha">
<span class="label">Fecha</span>
<span class="txt">31-10-2021 - 12:00</span>
</div>
<div class="arbitros">
<span class="label">Árbitros</span>
<span class="txt referee">DIAZ DE SARRALDE MARTIN, IÑIGO</span><span class="txt referee">SANCHEZ NUÑEZ, UNAI</span><span class="txt referee"></span>
</div>
<div class="pista">
<span class="label">Pista</span>
<span class="txt pabellon">POLIDEPORTIVO URRETA</span><span class="txt direccion">BIZKAIA KALEA, S/N, Vizcaya (Galdakao)</span>
</div>
</div>
<div class="box-datos-partido">
<div class="fecha">
<span class="label">Fecha</span>
<span class="txt">31/10/2021 - 12:00</span>
</div>
<div class="arbitros">
<span class="label">Árbitros</span>
<span class="txt referee">DIAZ DE SARRALDE MARTIN, IÑIGO</span>
<span class="txt referee">SANCHEZ NUÑEZ, UNAI</span>
<span class="txt referee"></span>
</div>
<div class="pista">
<span class="label">Pista</span>
<span class="txt pabellon">POLIDEPORTIVO URRETA</span>
<span class="txt direccion">Galdakao (Vizcaya)</span>
</div>
</div>
How is that possible? What am I doing wrong to receive only one element of the two?
The data you see is loaded via JavaScript from external URL. To load it, you can use requests
module (this example will load the players into 2 pandas dataframes):
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {
"Authorization": "Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImQzOWE5MzlhZTQyZmFlMTM5NWJjODNmYjcwZjc1ZDc3IiwidHlwIjoiSldUIn0.eyJuYmYiOjE2NTkyNjM1MDUsImV4cCI6MTY1OTM0OTkwNSwiaXNzIjoiaHR0cHM6Ly9pbnRyYWZlYi5mZWIuZXMvaWRlbnRpdHkuYXBpIiwiYXVkIjpbImh0dHBzOi8vaW50cmFmZWIuZmViLmVzL2lkZW50aXR5LmFwaS9yZXNvdXJjZXMiLCJsaXZlc3RhdHMuYXBpIl0sImNsaWVudF9pZCI6ImJhbG9uY2VzdG9lbnZpdm9hcHAiLCJpZGFtYml0byI6IjEiLCJyb2xlIjpbIk92ZXJWaWV3IiwiVGVhbVN0YXRzIiwiU2hvdENoYXJ0IiwiUmFua2luZyIsIktleUZhY3RzIiwiQm94U2NvcmUiXSwic2NvcGUiOlsibGl2ZXN0YXRzLmFwaSJdfQ.YDVnzLhZAw8kzE2LLjiS8VZayY-sfUgqMN4zdnjROLImHRamOJ_Htz4ehK26QcpywfZmrD5iUWnFnRFJrJyZdhudOp09B0tmn4HnWs4JHcQBirUpdLi4oDqONctn1J31OktVhHYpAS36Fs-2KTjwHcgR4G-EQsA6vxjkLKYjw6we0oY5w1Q_GUqRmEvfDQY3b2a-VlFEcxMQBS6XFfEL4naSz84w9aW2e7UCnic_Mm4CHzN1RzitcBSiunQyINshQzg-1G4STARAZZjfaVZCP8SDB4bWeuaXYxkwX40vbisJD8mXFP1xN93THlIg-d0LNfZg8iqD0Lx8xRf9nRdXug"
}
url = "https://intrafeb.feb.es/LiveStats.API/api/v1/BoxScore/2218269"
data = requests.get(url, headers=headers).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
t1 = data["BOXSCORE"]["TEAM"][0]["PLAYER"]
t2 = data["BOXSCORE"]["TEAM"][1]["PLAYER"]
df1 = pd.DataFrame(t1)
df2 = pd.DataFrame(t2)
print(df1)
print(df2)
Prints:
p1m p1a p1p p2m p2a p2p p3m p3a p3p fgm fga fgp min minFormatted sta bs tc mt ro rd rt rf to st ind pllss val assist reb pf pts inn id no name logo
0 4 6 66,7 0 5 0,0 0 6 0,0 0 11 0,0 1812 30:12 None 0 0 0 0 3 3 5 6 1 None -1 None 1 3 1 4 1 2188507 0 J. ROYALE SACRISTAN https://competiciones.feb.es/estadisticas/Foto.aspx?c=2188507
1 0 0 0,0 0 5 0,0 0 0 0,0 0 5 0,0 1021 17:01 None 0 0 0 1 5 6 0 2 1 None -20 None 0 6 0 0 0 2188508 2 O. ARENAS DE LA HOZ https://competiciones.feb.es/estadisticas/Foto.aspx?c=2188508
2 0 0 0,0 1 2 50,0 0 1 0,0 1 3 33,3 1363 22:43 None 0 0 0 0 2 2 1 2 1 None -4 None 1 2 0 2 0 2277838 4 A. RAMASCO CERECERO https://competiciones.feb.es/estadisticas/Foto.aspx?c=2277838
...
Actually, two divs with the same class = "box-datos-partido"
that’s right but if you make disabled JavaScript then you will notice that the same selection is selecting only one of them(first one) because rest of them are loaded dynamically by JavaScript. If you want to pull them then you can take help with an automation tool something like selenium. Here I use selenium with bs4 to grab the right divs with html content.
Example:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url='https://baloncestoenvivo.feb.es/partido/2218269'
driver.get(url)
driver.maximize_window()
time.sleep(5)
soup=BeautifulSoup(driver.page_source,'lxml')
for card in soup.select('div.box-datos-partido'):
print(card.prettify())
Output:
<div class="box-datos-partido">
<div class="fecha">
<span class="label">
Fecha
</span>
<span class="txt">
31-10-2021 - 12:00
</span>
</div>
<div class="arbitros">
<span class="label">
Árbitros
</span>
<span class="txt referee">
DIAZ DE SARRALDE MARTIN, IÑIGO
</span>
<span class="txt referee">
SANCHEZ NUÑEZ, UNAI
</span>
<span class="txt referee">
</span>
</div>
<div class="pista">
<span class="label">
Pista
</span>
<span class="txt pabellon">
POLIDEPORTIVO URRETA
</span>
<span class="txt direccion">
BIZKAIA KALEA, S/N, Vizcaya (Galdakao)
</span>
</div>
</div>
<div class="box-datos-partido">
<div class="fecha">
<span class="label">
Fecha
</span>
<span class="txt">
31/10/2021 - 12:00
</span>
</div>
<div class="arbitros">
<span class="label">
Árbitros
</span>
<span class="txt referee">
DIAZ DE SARRALDE MARTIN, IÑIGO
</span>
<span class="txt referee">
SANCHEZ NUÑEZ, UNAI
</span>
<span class="txt referee">
</span>
</div>
<div class="pista">
<span class="label">
Pista
</span>
<span class="txt pabellon">
POLIDEPORTIVO URRETA
</span>
<span class="txt direccion">
Galdakao (Vizcaya)
</span>
</div>
</div>