why did my web-scraping method stop working on one particular site?

Question:

several months ago I regularly used a python script to scrape and parse basketball odds from a particular website. after a couple months without using I tried to run the same script, only to find it now throws an error.

I’m looking for 1) the reason the script now fails, and 2) a functioning workaround.

the line of code which is the source of the error is below. i use this method to scrape other websites without issue.

source = requests.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball').json()

previously, the above command would acquire usable source data. now, "JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)"

I tried an alternate scraping method for the same target site. interestingly, when I enter the commands (below) line-by-line, I can successfully acquire the data. when I run the code as a script, no data is acquired.

browser = webdriver.Chrome() 
browser.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball')
page_source = browser.page_source 

Is this specific target site is somehow protected against automated scraping? are there any workarounds?

Asked By: timmy_pickens

||

Answers:

I was able to get correct response from the server setting the User-Agent header and forcing disabling the caching using dummy url parameter. E.g.:

import time
import requests

api_url = 'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0',
}

source = requests.get(api_url, headers=headers, params={'_t': int(time.time())})
print(source.json())

Prints:

[
    {
        "path": [
            {
                "id": "11344232",
                "link": "/basketball/nba-futures/nba-championship-2023-24",
                "description": "NBA Championship 2023/24",
                "type": "LEAGUE",
                "sportCode": "BASK",
                "order": 9223372036854775807,

...
Answered By: Andrej Kesely

works when 1) a valid user-agent is set and 2) using a requests.Session to get the homepage first (maybe is sets some cookie).

import requests
from pprint import pp


base_url = 'https://www.bovada.lv'
url = (
    'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
    )

session, timeout = requests.Session(), 3.05
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:104.0) Gecko/20100101 Firefox/104.0'
    })
session.mount(base_url, requests.adapters.HTTPAdapter())

response = session.get(base_url, timeout=timeout)
response = session.get(url, timeout=timeout)
pp(response.json())
Answered By: richard
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.