How can I trigger same Cloud Run job/service using different arguments?

Question:

I’m trying to make a scrapy scraper work using cloud run. The main idea is that every 20 minutes a cloud scheduler cron should trigger the web scraper and get data from different sites. All sites have the same structure, so I would like to use same code and parallelize the execution of the scraping job, doing something like scrapy crawl scraper -a site=www.site1.com and scrapy crawl scraper -a site=www.site2.com.

I have already deployed a version of the scraper, but it only can do scrapy crawl scraper. How can I do that at execution the command’s site change?

Also, should I be using cloud run job or service?

Asked By: xerac

||

Answers:

There is no direct way.

Cloud Scheduler can call your application with parameters, but you would need to create a new job for each set of parameters.

Cloud run supports environment variables, but you would need to redeploy your app to modify them.

You can store parameters for your application in Secret Manager or Cloud Storage. Your app would then read the current configuration from one of those locations.

Answered By: John Hanley

According to that page of documentation, there is a trick.

  • Define a number of task, let’s say, you set the number of task equal to the number of site to scrap. use the –task parameter for that
  • In your container (or in Cloud Storage, but if you do that, you have to download the file before starting the process), add a file with 1 website to scrap per line.
  • At runtime, use the CLOUD_RUN_TASK_INDEX environment variable. That variable indicate the number of the task in the execution. For each different number, pick a line in your file of websites (the number of the line equal to the env var value).

Like that, you can leverage Cloud Run jobs and parallelism.


The main tradeoff here is the static form of the websites list to scrap.

Answered By: guillaume blaquiere
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.