Connection timeouts as a protection from site scraping?
Question:
I am new to Python and Web scraping but it’s been two weeks that I periodically scrape one website and successfully download images from it. I use different proxies and sometimes change them. But starting yesterday all my proxies suddenly stopped working with a timeout error. I’ve tried a whole list of them and all fail.
Could this be a kind of site protection from scraping? If yes, is there a way to overcome it?
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
proxies = {
"http": "http://188.114.99.153",
"https": "http://180.94.69.66:8080"
}
url = 'https://parovoz.com/newgallery/index.php?&LNG=RU&NO_ICONS=0&CATEG=-1&HOWMANY=192'
html = requests.get(url, headers=header, proxies=proxies, timeout=10).text
soup = BeautifulSoup(html, 'lxml')
Error message:
ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001536A8E7190>, 'Connection to 180.94.69.66 timed out. (connect timeout=10)'))
Answers:
This will GET
the URL and retry 3 times in case of ConnectTimeoutError
. It will help to apply delays between attempts to avoid failing again in case of periodic request quota.
Take a look at urllib3.util.retry.Retry, it has many options to simplify retries.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://parovoz.com/newgallery/index.php?&LNG=RU&NO_ICONS=0&CATEG=-1&HOWMANY=192'
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
html = session.get(url, headers=header).text
soup = BeautifulSoup(html, 'lxml')
print(soup)
I am new to Python and Web scraping but it’s been two weeks that I periodically scrape one website and successfully download images from it. I use different proxies and sometimes change them. But starting yesterday all my proxies suddenly stopped working with a timeout error. I’ve tried a whole list of them and all fail.
Could this be a kind of site protection from scraping? If yes, is there a way to overcome it?
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
proxies = {
"http": "http://188.114.99.153",
"https": "http://180.94.69.66:8080"
}
url = 'https://parovoz.com/newgallery/index.php?&LNG=RU&NO_ICONS=0&CATEG=-1&HOWMANY=192'
html = requests.get(url, headers=header, proxies=proxies, timeout=10).text
soup = BeautifulSoup(html, 'lxml')
Error message:
ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001536A8E7190>, 'Connection to 180.94.69.66 timed out. (connect timeout=10)'))
This will GET
the URL and retry 3 times in case of ConnectTimeoutError
. It will help to apply delays between attempts to avoid failing again in case of periodic request quota.
Take a look at urllib3.util.retry.Retry, it has many options to simplify retries.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://parovoz.com/newgallery/index.php?&LNG=RU&NO_ICONS=0&CATEG=-1&HOWMANY=192'
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
html = session.get(url, headers=header).text
soup = BeautifulSoup(html, 'lxml')
print(soup)