Web scraping with Python. Get href from "a" elements


With the following code I can get all data from the noted number of pages at the given URL:

import pandas as pd

F, L = 1, 2 # first and last pages

dico = {}
for page in range(F, L+1):
    url = f'https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page={page}&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17'
    sub_df = pd.read_html(url, parse_dates=True)[0]
    #sub_df.insert(0, "page_number", page)
    sub_df.insert(1, "Year", "AT")
    sub_df.insert(2, "Ind_Out", "I")
    sub_df.insert(3, "Gender", "M")
    sub_df.insert(4, "Event", "MILLA")
    sub_df.insert(5, "L_N", "L")
    dico[page] = sub_df
out = pd.concat(dico, ignore_index=True)

But I need to get the athletes’ code (field "Competitor").

How could I insert a field with the href of each competitor?

Asked By: CarlosFC



I’m not really sure why you’re doing everything you’re doing in your code, but to get the table on that page with an additional column for the competitor code from the link, I would do this (in this example, just for the first page, but you can obviously extend it):

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = "https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page=1&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17"
req =  requests.get(url)

#this gets you the whole table, as is:
sub_df = pd.read_html(req.text)[0]
#we need this to extract the codes:
soup = bs(req.text,"html.parser")
codes = [comp['href'].split('=')[1] for comp in soup.select('table.records-table td[data-th="Competitor"] a')]

#we then insert the codes as a new column in the df
sub_df.insert(3, 'Code', codes)

You should now have a new column right after Competitor. You can drop whatever column you don’t want, add other columns and so on.

Answered By: Jack Fleeting
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.