How to scrape data from paginated table?

Question

I need your help trying to automate this web page by getting the data of all the players on the different pages.

import request
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.mlb.com/es/stats/spring-training'

pagina = requests.get(url2)

soup = BeautifulSoup(pagina.text, 'lxml')

table = soup.find('table', {'class':"bui-table is-desktop-sKqjv9Sb"})


encabezados = []

for i in table.find_all('th')[:18]:
    datos = i.find_all('button')
    for td in datos:
     titulo = td.text.strip()
    encabezados.append(titulo)

datos_mlb = pd.DataFrame(columns = encabezados)


nombres = []

for i in table.find_all('th')[18:]:
    datos = i.find_all('a')
    for td in datos:
     jugadores = td.text.strip() 
    nombres.append(jugadores)
    
datos_mlb['JUGADOR'] = nombres


for fila in table.find_all('tr')[1:]:
    data = fila.find_all('td')
    data_fila = [td.text.strip() for td in data]
    largo = len(datos_mlb)-1
    datos_mlb.iloc[:,1:] = data_fila

I have tried to fit the vast majority of information, however I cannot complete the data correctly and iterate all the pages.

Asked By: Esteban Madrigal

||

Source

Answer 1

Try to use the structured data from JSON response of XHR request to create your dataframe. Inspect network tab in your browsers devtools, to get an idea, what parameters you should send and what you will get:

import pandas as pd
import requests

data = []

for i in range(0,175,25):
    data.extend(
        requests.get(
            f'https://bdfed.stitch.mlbinfra.com/bdfed/stats/player?stitch_env=prod&season=2022&sportId=1&stats=season&group=hitting&gameType=S&limit=25&offset={i}&sortStat=onBasePlusSlugging&order=desc', 
            headers = {'user-agent': 'Mozilla/5.0'}
        ).json()['stats']
    )
pd.DataFrame(data)

Output

	playerId	playerName	…	type	atBatsPerHomeRun
0	502671	Paul Goldschmidt	…	player	5.5
1	621439	Byron Buxton	…	player	6.4
2	547180	Bryce Harper	…	player	4.38
3	658668	Edward Olivares	…	player	11.33
4	670351	Jose Rojas	…	player	9
…		…		…
156	593871	Jorge Polanco	…	player	32.00
157	676475	Alec Burleson	…	player	-.–
158	608385	Jesse Winker	…	player	-.–
159	641355	Cody Bellinger	…	player	-.–
160	660162	Yoan Moncada	…	player	-.–

[161 rows x 72 columns]

Answered By: HedgeHog

Answer 2

You are not getting all the required data because data is loaded dynamically via API.So you have to pull data from API.

Example:

import pandas as pd
import requests
api_url = 'https://bdfed.stitch.mlbinfra.com/bdfed/stats/player?stitch_env=prod&season=2022&sportId=1&stats=season&group=hitting&gameType=S&limit=161&offset=0&sortStat=onBasePlusSlugging&order=desc'  
req = requests.get(api_url).json()

data =[]
for item in req['stats']:
    playerName=item['playerName']
    data.append({
        'playerName':playerName
        })

df = pd.DataFrame(data)
print(df)

Output:

        playerName
0    Paul Goldschmidt
1        Byron Buxton
2        Bryce Harper
3     Edward Olivares
4          Jose Rojas
..                ...
156     Jorge Polanco
157     Alec Burleson
158      Jesse Winker
159    Cody Bellinger
160      Yoan Moncada

[161 rows x 1 columns]

Answered By: F.Hoque

How to scrape data from paginated table?

Question:

Answers:

Output