why did my web-scraping method stop working on one particular site?
Question:
several months ago I regularly used a python script to scrape and parse basketball odds from a particular website. after a couple months without using I tried to run the same script, only to find it now throws an error.
I’m looking for 1) the reason the script now fails, and 2) a functioning workaround.
the line of code which is the source of the error is below. i use this method to scrape other websites without issue.
source = requests.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball').json()
previously, the above command would acquire usable source data. now, "JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)"
I tried an alternate scraping method for the same target site. interestingly, when I enter the commands (below) line-by-line, I can successfully acquire the data. when I run the code as a script, no data is acquired.
browser = webdriver.Chrome()
browser.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball')
page_source = browser.page_source
Is this specific target site is somehow protected against automated scraping? are there any workarounds?
Answers:
I was able to get correct response from the server setting the User-Agent
header and forcing disabling the caching using dummy url parameter. E.g.:
import time
import requests
api_url = 'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0',
}
source = requests.get(api_url, headers=headers, params={'_t': int(time.time())})
print(source.json())
Prints:
[
{
"path": [
{
"id": "11344232",
"link": "/basketball/nba-futures/nba-championship-2023-24",
"description": "NBA Championship 2023/24",
"type": "LEAGUE",
"sportCode": "BASK",
"order": 9223372036854775807,
...
works when 1) a valid user-agent
is set and 2) using a requests.Session
to get the homepage first (maybe is sets some cookie).
import requests
from pprint import pp
base_url = 'https://www.bovada.lv'
url = (
'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
)
session, timeout = requests.Session(), 3.05
session.headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:104.0) Gecko/20100101 Firefox/104.0'
})
session.mount(base_url, requests.adapters.HTTPAdapter())
response = session.get(base_url, timeout=timeout)
response = session.get(url, timeout=timeout)
pp(response.json())
several months ago I regularly used a python script to scrape and parse basketball odds from a particular website. after a couple months without using I tried to run the same script, only to find it now throws an error.
I’m looking for 1) the reason the script now fails, and 2) a functioning workaround.
the line of code which is the source of the error is below. i use this method to scrape other websites without issue.
source = requests.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball').json()
previously, the above command would acquire usable source data. now, "JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)"
I tried an alternate scraping method for the same target site. interestingly, when I enter the commands (below) line-by-line, I can successfully acquire the data. when I run the code as a script, no data is acquired.
browser = webdriver.Chrome()
browser.get('https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball')
page_source = browser.page_source
Is this specific target site is somehow protected against automated scraping? are there any workarounds?
I was able to get correct response from the server setting the User-Agent
header and forcing disabling the caching using dummy url parameter. E.g.:
import time
import requests
api_url = 'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0',
}
source = requests.get(api_url, headers=headers, params={'_t': int(time.time())})
print(source.json())
Prints:
[
{
"path": [
{
"id": "11344232",
"link": "/basketball/nba-futures/nba-championship-2023-24",
"description": "NBA Championship 2023/24",
"type": "LEAGUE",
"sportCode": "BASK",
"order": 9223372036854775807,
...
works when 1) a valid user-agent
is set and 2) using a requests.Session
to get the homepage first (maybe is sets some cookie).
import requests
from pprint import pp
base_url = 'https://www.bovada.lv'
url = (
'https://www.bovada.lv/services/sports/event/v2/events/A/description/basketball'
)
session, timeout = requests.Session(), 3.05
session.headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:104.0) Gecko/20100101 Firefox/104.0'
})
session.mount(base_url, requests.adapters.HTTPAdapter())
response = session.get(base_url, timeout=timeout)
response = session.get(url, timeout=timeout)
pp(response.json())