Scrapy and proxies

Question:

How do you utilize proxy support with the python web-scraping framework Scrapy?

Asked By: no1

||

Answers:

From the Scrapy FAQ,

Does Scrapy work with HTTP proxies?

Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware.

The easiest way to use a proxy is to set the environment variable http_proxy. How this is done depends on your shell.

C:>set http_proxy=http://proxy:port
csh% setenv http_proxy http://proxy:port
sh$ export http_proxy=http://proxy:port

if you want to use https proxy and visited https web,to set the environment variable http_proxy you should follow below,

C:>set https_proxy=https://proxy:port
csh% setenv https_proxy https://proxy:port
sh$ export https_proxy=https://proxy:port
Answered By: ephemient

that would be:

export http_proxy=http://user:password@proxy:port

Answered By: laurent alsina

Single Proxy

  1. Enable HttpProxyMiddleware in your settings.py, like this:

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1
    }
    
  2. pass proxy to request via request.meta:

    request = Request(url="http://example.com")
    request.meta['proxy'] = "host:port"
    yield request
    

You also can choose a proxy address randomly if you have an address pool. Like this:

Multiple Proxies

class MySpider(BaseSpider):
    name = "my_spider"
    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']

    def parse(self, response):
        ...parse code...
        if something:
            yield self.get_request(url)

    def get_request(self, url):
        req = Request(url=url)
        if self.proxy_pool:
            req.meta['proxy'] = random.choice(self.proxy_pool)
        return req
Answered By: Amom

1-Create a new file called “middlewares.py” and save it in your scrapy project and add the following code to it.

import base64
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2 – Open your project’s configuration file (./project_name/settings.py) and add the following code

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'project_name.middlewares.ProxyMiddleware': 100,
}

Now, your requests should be passed by this proxy. Simple, isn’t it ?

Answered By: Shahryar Saljoughi

In Windows I put together a couple of previous answers and it worked. I simply did:

C:>  set http_proxy = http://username:password@proxy:port

and then I launched my program:

C:/.../RightFolder> scrapy crawl dmoz

where “dmzo” is the program name (I’m writing it because it’s the one you find in a tutorial on internet, and if you’re here you have probably started from the tutorial).

Answered By: Andrea Ianni ௫

As I’ve had trouble by setting the environment in /etc/environment, here is what I’ve put in my spider (Python):

os.environ["http_proxy"] = "http://localhost:12345"
Answered By: user494599

There is nice middleware written by someone [1]: https://github.com/aivarsk/scrapy-proxies “Scrapy proxy middleware”

Answered By: Niranjan Sagar

I would recommend you to use a middleware such as scrapy-proxies. You can rotate proxies, filter bad proxies or use a single proxy for all your request. Also,using a middleware will save you the trouble of setting up proxy on every run.

This is directly from the GitHub README.

  • Install the scrapy-rotating-proxy library

    pip install scrapy_proxies

  • In your settings.py add the following settings

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"

Here you can change retry times, set a single or rotating proxy

  • Then add your proxy to a list.txt file like this
http://host1:port
http://username:password@host2:port
http://host3:port

After this all your requests for that project will be sent through proxy. Proxy is rotated for every request randomly. It will not affect concurrency.

Note: if you donot want to use proxy. You can simply comment the scrapy_proxy middleware line.

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
#    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

Happy crawling!!!

Answered By: Amit

Here is what I do

Method 1:

Create a Download Middleware like this

class ProxiesDownloaderMiddleware(object):

    def process_request(self, request, spider):
        
        request.meta['proxy'] = 'user:pass@host:port'

and enable that in settings.py

DOWNLOADER_MIDDLEWARES: {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    'my_scrapy_project_directory.middlewares.ProxiesDownloaderMiddleware': 600,
},

That is it, now proxy will be applied to every request

Method 2:

Just enable HttpProxyMiddleware in settings.py and then do this for each request

yield Request(url=..., meta={'proxy': 'user:pass@host:port'})
Answered By: Umair Ayub
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.