Scrapy and proxies
Question:
How do you utilize proxy support with the python web-scraping framework Scrapy?
Answers:
From the Scrapy FAQ,
Does Scrapy work with HTTP proxies?
Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware
.
The easiest way to use a proxy is to set the environment variable http_proxy
. How this is done depends on your shell.
C:>set http_proxy=http://proxy:port
csh% setenv http_proxy http://proxy:port
sh$ export http_proxy=http://proxy:port
if you want to use https proxy and visited https web,to set the environment variable http_proxy
you should follow below,
C:>set https_proxy=https://proxy:port
csh% setenv https_proxy https://proxy:port
sh$ export https_proxy=https://proxy:port
that would be:
export http_proxy=http://user:password@proxy:port
Single Proxy
-
Enable HttpProxyMiddleware
in your settings.py
, like this:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1
}
-
pass proxy to request via request.meta
:
request = Request(url="http://example.com")
request.meta['proxy'] = "host:port"
yield request
You also can choose a proxy address randomly if you have an address pool. Like this:
Multiple Proxies
class MySpider(BaseSpider):
name = "my_spider"
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']
def parse(self, response):
...parse code...
if something:
yield self.get_request(url)
def get_request(self, url):
req = Request(url=url)
if self.proxy_pool:
req.meta['proxy'] = random.choice(self.proxy_pool)
return req
1-Create a new file called “middlewares.py” and save it in your scrapy project and add the following code to it.
import base64
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
2 – Open your project’s configuration file (./project_name/settings.py) and add the following code
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'project_name.middlewares.ProxyMiddleware': 100,
}
Now, your requests should be passed by this proxy. Simple, isn’t it ?
In Windows I put together a couple of previous answers and it worked. I simply did:
C:> set http_proxy = http://username:password@proxy:port
and then I launched my program:
C:/.../RightFolder> scrapy crawl dmoz
where “dmzo” is the program name (I’m writing it because it’s the one you find in a tutorial on internet, and if you’re here you have probably started from the tutorial).
As I’ve had trouble by setting the environment in /etc/environment, here is what I’ve put in my spider (Python):
os.environ["http_proxy"] = "http://localhost:12345"
There is nice middleware written by someone [1]: https://github.com/aivarsk/scrapy-proxies “Scrapy proxy middleware”
I would recommend you to use a middleware such as scrapy-proxies. You can rotate proxies, filter bad proxies or use a single proxy for all your request. Also,using a middleware will save you the trouble of setting up proxy on every run.
This is directly from the GitHub README.
-
Install the scrapy-rotating-proxy library
pip install scrapy_proxies
-
In your settings.py add the following settings
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'
# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0
# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"
Here you can change retry times, set a single or rotating proxy
- Then add your proxy to a list.txt file like this
http://host1:port
http://username:password@host2:port
http://host3:port
After this all your requests for that project will be sent through proxy. Proxy is rotated for every request randomly. It will not affect concurrency.
Note: if you donot want to use proxy. You can simply comment the scrapy_proxy middleware line.
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
# 'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
Happy crawling!!!
Here is what I do
Method 1:
Create a Download Middleware like this
class ProxiesDownloaderMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'user:pass@host:port'
and enable that in settings.py
DOWNLOADER_MIDDLEWARES: {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'my_scrapy_project_directory.middlewares.ProxiesDownloaderMiddleware': 600,
},
That is it, now proxy will be applied to every request
Method 2:
Just enable HttpProxyMiddleware
in settings.py
and then do this for each request
yield Request(url=..., meta={'proxy': 'user:pass@host:port'})
How do you utilize proxy support with the python web-scraping framework Scrapy?
From the Scrapy FAQ,
Does Scrapy work with HTTP proxies?
Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See
HttpProxyMiddleware
.
The easiest way to use a proxy is to set the environment variable http_proxy
. How this is done depends on your shell.
C:>set http_proxy=http://proxy:port csh% setenv http_proxy http://proxy:port sh$ export http_proxy=http://proxy:port
if you want to use https proxy and visited https web,to set the environment variable http_proxy
you should follow below,
C:>set https_proxy=https://proxy:port csh% setenv https_proxy https://proxy:port sh$ export https_proxy=https://proxy:port
that would be:
export http_proxy=http://user:password@proxy:port
Single Proxy
-
Enable
HttpProxyMiddleware
in yoursettings.py
, like this:DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1 }
-
pass proxy to request via
request.meta
:request = Request(url="http://example.com") request.meta['proxy'] = "host:port" yield request
You also can choose a proxy address randomly if you have an address pool. Like this:
Multiple Proxies
class MySpider(BaseSpider):
name = "my_spider"
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']
def parse(self, response):
...parse code...
if something:
yield self.get_request(url)
def get_request(self, url):
req = Request(url=url)
if self.proxy_pool:
req.meta['proxy'] = random.choice(self.proxy_pool)
return req
1-Create a new file called “middlewares.py” and save it in your scrapy project and add the following code to it.
import base64
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
2 – Open your project’s configuration file (./project_name/settings.py) and add the following code
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'project_name.middlewares.ProxyMiddleware': 100,
}
Now, your requests should be passed by this proxy. Simple, isn’t it ?
In Windows I put together a couple of previous answers and it worked. I simply did:
C:> set http_proxy = http://username:password@proxy:port
and then I launched my program:
C:/.../RightFolder> scrapy crawl dmoz
where “dmzo” is the program name (I’m writing it because it’s the one you find in a tutorial on internet, and if you’re here you have probably started from the tutorial).
As I’ve had trouble by setting the environment in /etc/environment, here is what I’ve put in my spider (Python):
os.environ["http_proxy"] = "http://localhost:12345"
There is nice middleware written by someone [1]: https://github.com/aivarsk/scrapy-proxies “Scrapy proxy middleware”
I would recommend you to use a middleware such as scrapy-proxies. You can rotate proxies, filter bad proxies or use a single proxy for all your request. Also,using a middleware will save you the trouble of setting up proxy on every run.
This is directly from the GitHub README.
-
Install the scrapy-rotating-proxy library
pip install scrapy_proxies
-
In your settings.py add the following settings
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'
# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0
# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"
Here you can change retry times, set a single or rotating proxy
- Then add your proxy to a list.txt file like this
http://host1:port
http://username:password@host2:port
http://host3:port
After this all your requests for that project will be sent through proxy. Proxy is rotated for every request randomly. It will not affect concurrency.
Note: if you donot want to use proxy. You can simply comment the scrapy_proxy middleware line.
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
# 'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
Happy crawling!!!
Here is what I do
Method 1:
Create a Download Middleware like this
class ProxiesDownloaderMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'user:pass@host:port'
and enable that in settings.py
DOWNLOADER_MIDDLEWARES: {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'my_scrapy_project_directory.middlewares.ProxiesDownloaderMiddleware': 600,
},
That is it, now proxy will be applied to every request
Method 2:
Just enable HttpProxyMiddleware
in settings.py
and then do this for each request
yield Request(url=..., meta={'proxy': 'user:pass@host:port'})