Scrapy and proxies

Question

How do you utilize proxy support with the python web-scraping framework Scrapy?

Asked By: no1

||

Answer 1

Does Scrapy work with HTTP proxies?

Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware.

The easiest way to use a proxy is to set the environment variable http_proxy. How this is done depends on your shell.

C:>set http_proxy=http://proxy:port
csh% setenv http_proxy http://proxy:port
sh$ export http_proxy=http://proxy:port

if you want to use https proxy and visited https web,to set the environment variable http_proxy you should follow below,

C:>set https_proxy=https://proxy:port
csh% setenv https_proxy https://proxy:port
sh$ export https_proxy=https://proxy:port

Answered By: ephemient

Answer 2

that would be:

export http_proxy=http://user:password@proxy:port

Answered By: laurent alsina

Answer 3

Single Proxy

Enable HttpProxyMiddleware in your settings.py, like this:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1
}

pass proxy to request via request.meta:

request = Request(url="http://example.com")
request.meta['proxy'] = "host:port"
yield request

You also can choose a proxy address randomly if you have an address pool. Like this:

Multiple Proxies

class MySpider(BaseSpider):
    name = "my_spider"
    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']

    def parse(self, response):
        ...parse code...
        if something:
            yield self.get_request(url)

    def get_request(self, url):
        req = Request(url=url)
        if self.proxy_pool:
            req.meta['proxy'] = random.choice(self.proxy_pool)
        return req

Answered By: Amom

Answer 4

1-Create a new file called “middlewares.py” and save it in your scrapy project and add the following code to it.

import base64
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2 – Open your project’s configuration file (./project_name/settings.py) and add the following code

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'project_name.middlewares.ProxyMiddleware': 100,
}

Now, your requests should be passed by this proxy. Simple, isn’t it ?

Answered By: Shahryar Saljoughi

Answer 5

In Windows I put together a couple of previous answers and it worked. I simply did:

C:>  set http_proxy = http://username:password@proxy:port

and then I launched my program:

C:/.../RightFolder> scrapy crawl dmoz

where “dmzo” is the program name (I’m writing it because it’s the one you find in a tutorial on internet, and if you’re here you have probably started from the tutorial).

Answered By: Andrea Ianni ௫

Answer 6

As I’ve had trouble by setting the environment in /etc/environment, here is what I’ve put in my spider (Python):

os.environ["http_proxy"] = "http://localhost:12345"

Answered By: user494599

Answer 7

There is nice middleware written by someone [1]: https://github.com/aivarsk/scrapy-proxies “Scrapy proxy middleware”

Answered By: Niranjan Sagar

Answer 8

I would recommend you to use a middleware such as scrapy-proxies. You can rotate proxies, filter bad proxies or use a single proxy for all your request. Also,using a middleware will save you the trouble of setting up proxy on every run.

This is directly from the GitHub README.

Install the scrapy-rotating-proxy library

pip install scrapy_proxies
In your settings.py add the following settings

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"

Here you can change retry times, set a single or rotating proxy

Then add your proxy to a list.txt file like this

http://host1:port
http://username:password@host2:port
http://host3:port

After this all your requests for that project will be sent through proxy. Proxy is rotated for every request randomly. It will not affect concurrency.

Note: if you donot want to use proxy. You can simply comment the scrapy_proxy middleware line.

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
#    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

Happy crawling!!!

Answered By: Amit

Answer 9

Here is what I do

Method 1:

Create a Download Middleware like this

class ProxiesDownloaderMiddleware(object):

    def process_request(self, request, spider):
        
        request.meta['proxy'] = 'user:pass@host:port'

and enable that in settings.py

DOWNLOADER_MIDDLEWARES: {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    'my_scrapy_project_directory.middlewares.ProxiesDownloaderMiddleware': 600,
},

That is it, now proxy will be applied to every request

Method 2:

Just enable HttpProxyMiddleware in settings.py and then do this for each request

yield Request(url=..., meta={'proxy': 'user:pass@host:port'})

Answered By: Umair Ayub

Scrapy and proxies

Question:

Answers:

Does Scrapy work with HTTP proxies?