How to get the proxy used for each request in scrapy logs?
Question:
I am using a custom proxy middleware for rotating proxies and I would like to get a log for the proxy used for each request:
packetstream_proxies = [
settings.get("PS_PROXY_USA"),
settings.get("PS_PROXY_CA"),
settings.get("PS_PROXY_IT"),
settings.get("PS_PROXY_GLOBAL"),
]
unlimited_proxies = [
settings.get("UNLIMITED_PROXY_1"),
settings.get("UNLIMITED_PROXY_2"),
settings.get("UNLIMITED_PROXY_3"),
settings.get("UNLIMITED_PROXY_4"),
settings.get("UNLIMITED_PROXY_5"),
settings.get("UNLIMITED_PROXY_6"),
]
class SdtProxyMiddleware(object):
def process_request(self, request, spider):
# retries = request.meta.get("retry_times", 0)
request.meta["proxy"] = random.choice(packetstream_proxies)
if request.meta.get("retry_times") == 1:
request.meta["proxy"] = random.choice(unlimited_proxies)
return None
def process_response(self, request, response, spider):
spider.logger.info(
f"""Processed {response.url} with {request.meta.get("proxy")}"""
)
return response
I tried to log in process_response
like this:
def process_response(self, request, response, spider):
spider.logger.info(
f"""Processed {response.url} with {request.meta.get("proxy")}"""
)
return response
but it is working for some URLs like this Processed https://www.saksfifthavenue.com/c/men/apparel with http://proxyserver:8884
and most of the time it is Processed https://www.saksfifthavenue.com/c/women-s-apparel?start=168&sz=24 with None
Answers:
On the latest versions of scrapy: HttpProxy middleware – delete proxy meta data from request.meta['proxy']
after updating requests scrapy source related code lines
As result of this You receive None
values on log entries from process_response
This change applied to scrapy several months ago due to security issue mentioned on:
https://docs.scrapy.org/en/latest/news.html#scrapy-2-6-2-2022-07-25
https://github.com/advisories/GHSA-9x8m-2xpf-crp3
Update
Applying proxy logging can be implemented by middleware like this:
class ProxyLoggingMiddleware:
process_request(self, request, spider):
spider.logger.info(
f"""Request {request.url} sent with {request.meta.get("proxy")}"""
)
In settings
DOWNLOADER_MIDDLEWARES = {
`pathto...SdtProxyMiddleware`: 0,
`pathto...ProxyLoggingMiddleware`: 950
}
I am using a custom proxy middleware for rotating proxies and I would like to get a log for the proxy used for each request:
packetstream_proxies = [
settings.get("PS_PROXY_USA"),
settings.get("PS_PROXY_CA"),
settings.get("PS_PROXY_IT"),
settings.get("PS_PROXY_GLOBAL"),
]
unlimited_proxies = [
settings.get("UNLIMITED_PROXY_1"),
settings.get("UNLIMITED_PROXY_2"),
settings.get("UNLIMITED_PROXY_3"),
settings.get("UNLIMITED_PROXY_4"),
settings.get("UNLIMITED_PROXY_5"),
settings.get("UNLIMITED_PROXY_6"),
]
class SdtProxyMiddleware(object):
def process_request(self, request, spider):
# retries = request.meta.get("retry_times", 0)
request.meta["proxy"] = random.choice(packetstream_proxies)
if request.meta.get("retry_times") == 1:
request.meta["proxy"] = random.choice(unlimited_proxies)
return None
def process_response(self, request, response, spider):
spider.logger.info(
f"""Processed {response.url} with {request.meta.get("proxy")}"""
)
return response
I tried to log in process_response
like this:
def process_response(self, request, response, spider):
spider.logger.info(
f"""Processed {response.url} with {request.meta.get("proxy")}"""
)
return response
but it is working for some URLs like this Processed https://www.saksfifthavenue.com/c/men/apparel with http://proxyserver:8884
and most of the time it is Processed https://www.saksfifthavenue.com/c/women-s-apparel?start=168&sz=24 with None
On the latest versions of scrapy: HttpProxy middleware – delete proxy meta data from request.meta['proxy']
after updating requests scrapy source related code lines
As result of this You receive None
values on log entries from process_response
This change applied to scrapy several months ago due to security issue mentioned on:
https://docs.scrapy.org/en/latest/news.html#scrapy-2-6-2-2022-07-25
https://github.com/advisories/GHSA-9x8m-2xpf-crp3
Update
Applying proxy logging can be implemented by middleware like this:
class ProxyLoggingMiddleware:
process_request(self, request, spider):
spider.logger.info(
f"""Request {request.url} sent with {request.meta.get("proxy")}"""
)
In settings
DOWNLOADER_MIDDLEWARES = {
`pathto...SdtProxyMiddleware`: 0,
`pathto...ProxyLoggingMiddleware`: 950
}