What's the difference between scrapy.cmdline.execute and executing a shell command, when running a scrapy spider in a python script?

Question:

When I want to run a scrapy spider, I could do it by calling either scrapy.cmdline.execute(['scrapy', 'crawl', 'myspider']) or os.system('scrapy crawl myspider') or subprocess.run(['scrapy', 'crawl', 'myspider']).

My question is: Why would I prefer to use scrapy.cmdline.execute over subprocess.run or os.system?

I haven’t found a word in the docs of scrapy about this function, neither does it have a docstring, but I see that it’s actively used in some tutorials and code examples.

Asked By: whatserface

||

Answers:

Using os.system or subprocess.run both run the command in a subprocess, where as with scrapy.cmdline.execute you are calling the scrapy entrypoint function directly and all of the code is then executed in the same process as the script that called the function.

  • Why would you choose one over the other?

Python officially recommends using the subprocess module over calls to os.system as a general rule, (see the documentation for os.system for more information) and the subprocess api is easier to use and offers more control, so the os.system option shouldn’t really be considered.

For the other two, while I am sure there are a multitude of reasons to choose one over the other, I wouldn’t recommend using either of these methods. Scrapy offers plenty of tools that help with executing spiders from scripts such as CrawlerProcess and CrawlerRunner that should make it unnecessary to access the CLI from a subprocess, or call the CLI entry point function directly from your script. (although I am sure there are plenty of exceptions to this)

Instead I recommend using the CLI tool as a CLI tool, and use the CrawlerProcess or similar when needing control scrapy via python code.

See Running scrapy from a script to learn more about how to run scrapy from python code.

Answered By: Alexander
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.