Cloud Run with Gunicorn Best-Practise

Question:

I am currently working on a service that is supposed to provide an HTTP endpoint in Cloud Run and I don’t have much experience. I am currently using flask + gunicorn and can also call the service. My main problem now is optimising for multiple simultaneous requests. Currently, the service in Cloud Run has 4GB of memory and 1 CPU allocated to it. When it is called once, the instance that is started directly consumes 3.7GB of memory and about 40-50% of the CPU (I use a neural network to embed my data). Currently, my settings are very basic:

  • memory: 4096M
  • CPU: 1
  • min-instances: 0
  • max-instances: 1
  • concurrency: 80
  • Workers: 1 (Gunicorn)
  • Threads: 1 (Gunicorn)
  • Timeout: 0 (Gunicorn, as recommended by Google)

If I up the number of workers to two, I would need to up the Memory to 8GB. If I do that my service should be able to work on two requests simultaneously with one instance, if this 1 CPU allocated, has more than one core. But what happens, if there is a thrid request? I would like to think, that Cloud Run will start a second instance. Does the new instance gets also 1 CPU and 8GB of memory and if not, what is the best practise for me?

Asked By: F3Tz

||

Answers:

One of the best practice is to let Cloud Run scale automatically instead of trying to optimize each instance. Using 1 worker is a good idea to limit the memory footprint and reduce the cold start.

I recommend to play with the threads, typically to put it to 8 or 16 to leverage the concurrency parameter.

If you put those value too low, Cloud Run internal load balancer will route the request to the instance, thinking it will be able to serve it, but if Gunicorn can’t access new request, you will have issues.

Tune your service with the correct parameter of CPU and memory, but also the thread and the concurrency to find the correct ones. Hey is a useful tool to stress your service and observe what’s happens when you scale.

Answered By: guillaume blaquiere

The best practice so far is For environments with multiple CPU cores, increase the number of workers to be equal to the cores available. Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling. Adjust the number of workers and threads on a per-application basis. For example, try to use a number of workers equal to the cores available and make sure there is a performance improvement, then adjust the number of threads.i.e.

CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app
Answered By: Eddie Villalba