Scraping data from https://www.transfermarkt.co.uk/ in Python

Question

I’m trying to follow along the steps from this article to scrape data from the transfermarkt website but I’m not getting the desired output. It seems some of the classes have changed since the article was written so I’ve had to change

Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
to

Players = pageSoup.find_all("td", {"class": "hauptlink"})

from bs4 import BeautifulSoup
import requests
import pandas as pd

headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/Version 110.0.5481.100 Safari/537.36'}

page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

Players = pageSoup.find_all("td", {"class": "hauptlink"})
Values = pageSoup.find_all("td", {"class": "rechts hauptlink"})

PlayersList = []
ValuesList = []

for i in range(0,25):
    PlayersList.append(Players[i].text)
    ValuesList.append(Values[i].text)
    
df = pd.DataFrame({"Players":PlayersList,"Values":ValuesList})

df.head(10)

The problem with this is it finds other classes of this type and adds them to the Players variable, eg Players[0].text returns 'nLuís Figo ' and Players[1].text returns 'nReal Madrid' because team names are also the same class as Player names. How can I select the first hauptlink class or somehow differentiate which one I want if they are the same?

Asked By: seevans38

||

Source

Answer 1

You were fairly close! I tried using requests as well, but every time I did so, I received a 404 response. I would infer that Transfermarkt has set up a control and it somehow recognizes that the session is coming from a bot.

To circumvent this, I resorted to Selenium which simulates a browsing session from Chrome. Then, I am finally able to get the html that is used as a source for our scraping purposes. The rest of the code is a mere readaptation of what you’ve built yourself.

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options


# Setting up the webdriver
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome('/path/to/your/chromedriver', options=options)

# Navigating to the webpage
url = 'https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000'
driver.get(url)


# Getting the HTML of the webpage
html = driver.page_source

# Using BeautifulSoup to parse the HTML
soup = BeautifulSoup(html, 'html.parser')

# Closing the webdriver
driver.quit()
table = soup.find('table', class_='items') 

# Initializing the lists
PlayersList = []
ValuesList = []

# Finding all the table rows (they alternate odd and even, in class names)
trs = soup.find_all('tr', class_=['odd', 'even']) 

# Looping through the table rows
for tr in trs:
    tds = tr.find_all('td', class_='hauptlink')
    player_name = tds[0].a.text
    fee = tr.find('td', class_='rechts hauptlink').text.strip()
    PlayersList.append(player_name)
    ValuesList.append(fee)

# Creating the dataframe and printing it
df = pd.DataFrame({'Player Name': PlayersList, 'Fee': ValuesList})
print(df)

The output I get is:

Answered By: shannontesla

Answer 2

I would not recommend Selenium here, as it is significantly slower than requests. In order to circumvent your issue, first select all information per player (ie per table row) and then subselect the information you are interested in.
Whereas the other posted answer is correct, find below an alternative piece of code that’s a bit faster and arguably a bit cleaner.

from bs4 import BeautifulSoup
import requests
import pandas as pd

headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/Version 110.0.5481.100 Safari/537.36'}

page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

# Get all player info
players = pageSoup.find_all('tr', {'class': ['odd', 'even']}) 

#Select names
PlayersList = [player.find_all('td', {'class': 'hauptlink'})[0].text.strip() for player in players]

#Select values
ValuesList = [player.find('td', {'class': 'rechts hauptlink'}).text.strip() for player in players]

df = pd.DataFrame({'Players': PlayersList, 'Values': ValuesList})

Answered By: Mitchell Olislagers

Scraping data from https://www.transfermarkt.co.uk/ in Python

Question:

Answers: