How do I efficiently check if data was returned in my GET request?
Question:
I am webscraping and need to parse through a few thousand GET requests at a time. Sometimes these requests fail and I get 429 and/or 403 errors so I need to check if there is data before parsing the response. I wrote this function:
def check_response(response):
if not response or not response.content:
return False
else:
soup = BeautifulSoup(response.content, "html.parser")
if not soup or not soup.find_all(attrs={"class": "stuff"}):
return False
return True
This works, but it can take quite a while to loop through a few thousand responses. Is there a better way?
Answers:
You can use the response.status_code
attribute to check the status code of the response. You can find a full list of HTTP error codes on MDN, but if it is >= 400, then it’s definitely an error. Try using this code:
def check_response(response):
if not response or not response.content or response.status_code >= 400:
return False
else:
soup = BeautifulSoup(response.content, "html.parser")
if not soup or not soup.find_all(attrs={"class": "stuff"}):
return False
return True
Note that you need to indent your return True
one level inwards, or else it will never be called because of the else-statement.
Notwithstanding the comments by @Michael M I propose the following:
def check_response(response): # the value passed is a returned value from requests.get and therefore will never be falsy
try:
response.raise_for_status()
soup = BeautifulSoup(response.txt, 'lxml')
if soup.find_all(attrs={"class": "stuff"}):
return True
except Exception:
pass
return False
I am webscraping and need to parse through a few thousand GET requests at a time. Sometimes these requests fail and I get 429 and/or 403 errors so I need to check if there is data before parsing the response. I wrote this function:
def check_response(response):
if not response or not response.content:
return False
else:
soup = BeautifulSoup(response.content, "html.parser")
if not soup or not soup.find_all(attrs={"class": "stuff"}):
return False
return True
This works, but it can take quite a while to loop through a few thousand responses. Is there a better way?
You can use the response.status_code
attribute to check the status code of the response. You can find a full list of HTTP error codes on MDN, but if it is >= 400, then it’s definitely an error. Try using this code:
def check_response(response):
if not response or not response.content or response.status_code >= 400:
return False
else:
soup = BeautifulSoup(response.content, "html.parser")
if not soup or not soup.find_all(attrs={"class": "stuff"}):
return False
return True
Note that you need to indent your return True
one level inwards, or else it will never be called because of the else-statement.
Notwithstanding the comments by @Michael M I propose the following:
def check_response(response): # the value passed is a returned value from requests.get and therefore will never be falsy
try:
response.raise_for_status()
soup = BeautifulSoup(response.txt, 'lxml')
if soup.find_all(attrs={"class": "stuff"}):
return True
except Exception:
pass
return False