Python 404'ing on urllib.request

Question:

The basics of the code are below. I know for a fact how I’m retrieving these pages works for other URLs, as I just wrote a script scraping a different page in the same way. However with this specific URL it keeps throwing “urllib.error.HTTPError: HTTP Error 404: Not Found” in my face. I replaced the URL with a different one (https://www.premierleague.com/clubs), and it works completely fine. I’m very new to python so perhaps there’s a really basic step or piece of knowledge I haven’t found, but resources I’ve found on line relating to this didn’t seem relevant. Any advice would be great, thanks.

Below is the barebones of the script:

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv

myurl = "https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1"

uClient = uReq(myurl)
Asked By: Danny

||

Answers:

The problem is most likely that the site you are trying to access is actively blocking spiders crawling; you can change the user agent to circumvent it. See this question for more information (the solution prescribed in that post seems to work for your url too).

If you want to use urllib this post tells you how to alter the user agent.

Answered By: jpw

It is showing a 404 because it thinks the website doesn’t exist.

You can try with a different module like requests.

This is the code for requests

import requests

resp = requests.get("https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1")

print(resp.text) # gets source code

I hope it works!

Answered By: Vihaan Mody
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.