How to add random user agent to scrapy spider when calling spider from script?

Question:

I want to add random user agent to every request for a spider being called by other script. My implementation is as follows:

CoreSpider.py

from scrapy.spiders import Rule
import ContentHandler_copy 

class CoreSpider(scrapy.Spider):
name = "final"
def __init__(self):
    self.start_urls = self.read_url()
    self.rules = (
        Rule(
            LinkExtractor(
                unique=True,
            ),
            callback='parse',
            follow=True
        ),
    )


def read_url(self):
    urlList = []
    for filename in glob.glob(os.path.join("/root/Public/company_profiler/seed_list", '*.list')):
        with open(filename, "r") as f:
            for line in f.readlines():
                url = re.sub('n', '', line)
                if "http" not in url:
                    url = "http://" + url
                # print(url)
                urlList.append(url)

    return urlList

def parse(self, response):
    print("URL is: ", response.url)
    print("User agent is : ", response.request.headers['User-Agent'])
    filename = '/root/Public/company_profiler/crawled_page/%s.html' % response.url
    article = Extractor(extractor='LargestContentExtractor', html=response.body).getText()
    print("Article is :", article)
    if len(article.split("n")) < 5:
        print("Skipping to next url : ", article.split("n"))
    else:
        print("Continue parsing: ", article.split("n"))
        ContentHandler_copy.ContentHandler_copy.start(article, response.url)

I am running this spider from a script as follows by RunSpider.py

from CoreSpider import CoreSpider
from scrapy.crawler import  CrawlerProcess



process = CrawlerProcess()
process.crawl(CoreSpider())
process.start()

It works fine, now I want to randomly use different user-agent for each request. I have successfully used random user-agent for scrapy project, but unable to integrate with this spider when calling this spider from other script.

My settings.py for working scrapy project –

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'random_useragent.RandomUserAgentMiddleware': 320
}

USER_AGENT_LIST = "tutorial/user-agent.txt"

How can I tell my CoreSpider.py to use this setting.py configuration programmatically?

Asked By: Om Prakash

||

Answers:

Take a look in the documentation, specifically Common Practices. You can supply settings as an argument to CrawlProcess constructor. Or, if you use Scrapy project and want to take settings from settings.py, you can do it like this:

...
from scrapy.utils.project import get_project_settings    
process = CrawlerProcess(get_project_settings())
...
Answered By: Tomáลก Linhart

Please don’t randomise your user agent and think about what you’re doing. If your script is causing trouble for a server, the system administrator needs an easy way to deny you access.

Answered By: depeje
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.