How can i get URL(links) behind the clickable text using Python

Question:

import requests
from bs4 import BeautifulSoup
import pandas as pd
session = requests.Session()
session.verify = False
session.trust_env = False
url = 'https://basketball.realgm.com/nba/teams'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')


teams = soup.findAll('div',{'class':'small-column-left'})

for team in teams:
    name = team.get_text().strip()
    schedule_url = team.get('a[href]')

 print(name)

i get the result as
Atlanta Hawks

Roster | Schedule | Stats

Charlotte Hornets

Roster | Schedule | Stats

Miami Heat

Roster | Schedule | Stats

Orlando Magic

Roster | Schedule | Stats

Washington Wizards

Roster | Schedule | Stats: https://basketball.realgm.com/nba/teams/Atlanta-Hawks/1/Home
Northwest Division

Denver Nuggets

Roster | Schedule | Stats

Minnesota Timberwolves

Roster | Schedule | Stats

Oklahoma City Thunder

Roster | Schedule | Stats

Portland Trail Blazers

Roster | Schedule | Stats

Utah Jazz

Roster | Schedule | Stats: https://basketball.realgm.com/nba/teams/Denver-Nuggets/7/Home
Pacific Division

but i want url for schedule only which are behind the clickable text

Asked By: Rajendra Jadhav

||

Answers:

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors – For more take a minute to check docs*

So select your elements more specific for example with css selectors:

[(t.parent.a.text,'https://basketball.realgm.com'+t.get('href')) for t in soup.select('.basketball a[href*="Schedule"]')]

Exmaple

import requests
from bs4 import BeautifulSoup
session = requests.Session()
session.verify = False
session.trust_env = False

url = 'https://basketball.realgm.com/nba/teams'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')


[(t.parent.a.text,'https://basketball.realgm.com'+t.get('href')) for t in soup.select('.basketball a[href*="Schedule"]')]

Output

['https://basketball.realgm.com/nba/teams/Boston-Celtics/2/Schedule/2023',
 'https://basketball.realgm.com/nba/teams/Brooklyn-Nets/38/Schedule/2023',
 'https://basketball.realgm.com/nba/teams/New-York-Knicks/20/Schedule/2023',
 'https://basketball.realgm.com/nba/teams/Philadelphia-Sixers/22/Schedule/2023',
 'https://basketball.realgm.com/nba/teams/Toronto-Raptors/28/Schedule/2023',
 'https://basketball.realgm.com/nba/teams/Chicago-Bulls/4/Schedule/2023',
 'https://basketball.realgm.com/nba/teams/Cleveland-Cavaliers/5/Schedule/2023',
 'https://basketball.realgm.com/nba/teams/Detroit-Pistons/8/Schedule/2023',
 'https://basketball.realgm.com/nba/teams/Indiana-Pacers/11/Schedule/2023',
 'https://basketball.realgm.com/nba/teams/Milwaukee-Bucks/16/Schedule/2023',
 'https://basketball.realgm.com/nba/teams/Atlanta-Hawks/1/Schedule/2023',...]

EDIT

Based on additionally comment, simply create a list of dicts and convert it to dataframe:

pd.DataFrame(
    [{'team':t.parent.a.text,'url':'https://basketball.realgm.com'+t.get('href')} for t in soup.select('.basketball a[href*="Schedule"]')]
)
Answered By: HedgeHog
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.