Gunicorn worker terminated with signal 9

Question:

I am running a Flask application and hosting it on Kubernetes from a Docker container. Gunicorn is managing workers that reply to API requests.

The following warning message is a regular occurrence, and it seems like requests are being canceled for some reason. On Kubernetes, the pod is showing no odd behavior or restarts and stays within 80% of its memory and CPU limits.

[2021-03-31 16:30:31 +0200] [1] [WARNING] Worker with pid 26 was terminated due to signal 9

How can we find out why these workers are killed?

Asked By: Jodiug

||

Answers:

I encountered the same warning message.

[WARNING] Worker with pid 71 was terminated due to signal 9

I came across this faq, which says that "A common cause of SIGKILL is when OOM killer terminates a process due to low memory condition."

I used dmesg realized that indeed it was killed because it was running out of memory.

Out of memory: Killed process 776660 (gunicorn)
Answered By: Simon

In my case the problem was in long application startup caused by ml model warm-up (over 3s)

Answered By: EgurnovD

I encountered the same warning message when I limit the docker’s memory, use like -m 3000m.

see docker-memory

and

gunicorn-Why are Workers Silently Killed?

The simple way to avoid this is set a high memory for docker or not set.

Answered By: hstk

In our case application was taking around 5-7 minutes to load ML models and dictionaries into memory.
So adding timeout period of 600 seconds solved the problem for us.

gunicorn main:app 
   --workers 1 
   --worker-class uvicorn.workers.UvicornWorker 
   --bind 0.0.0.0:8443 
   --timeout 600
Answered By: ACL

I was using AWS Beanstalk to deploy my flask application and I had a similar error.

  • In the log I saw:
  • web: MemoryError
  • [CRITICAL] WORKER TIMEOUT
  • [WARNING] Worker with pid XXXXX was terminated due to signal 9

I was using the t2.micro instance and when I changed it to t2.medium my app worked fine. In addition to this I changed to the timeout in my nginx config file.

Answered By: Vkey

It may be that your liveness check in kubernetes is killing your workers.

If your liveness check is configured as an http request to an endpoint in your service, your main request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.

That was my case. I have a gunicorn app with a single uvicorn worker, which only handles one request at a time. It worked fine locally but would have the worker sporadically killed when deployed to kubernetes. It would only happen during a call that takes about 25 seconds, and not every time.

It turned out that my liveness check was configuredto hit the /health route every 10 seconds, time out in 1 second, and retry 3 times. So this call would time out some times but not always.

If this is your case, a possible solution is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads – something that makes sure that the health check is not blocked for long enough to trigger worker kill.

You can see that adding more workers may help with (or hide) the problem.

Also, see this reply to a similar question: https://stackoverflow.com/a/73993486/2363627

Answered By: Gena Kukartsev

I encountered the same problem too. and it was because docker memory usage was limited to 2GB. If you are using docker desktop you just need to go to resources and increase the memory docker dedicated portion (if not you need to find the docker command line to do that).

If that doesn’t solve the problem, then it might be the timeout that kill the worker, you will need to add timeout arg to the gunicorn command:

CMD ["gunicorn","--workers", "3", "--timeout", "1000", "--bind", "0.0.0.0:8000", "wsgi:app"]
Answered By: Najlae Lemrabet

Check memory usage

In my case, I can not use dmesg command. so I check memory usage as docker command:

sudo docker stats <container-id>

CONTAINER ID   NAME               CPU %     MEM USAGE / LIMIT   MEM %     NET I/O        BLOCK I/O         PIDS
289e1ad7bd1d   funny_sutherland   0.01%     169MiB / 1.908GiB   8.65%     151kB / 96kB   8.23MB / 21.5kB   5

In my case, terminating workers are not caused by memory.

Answered By: Yoooda

In my case. I need to connect to a remote databse on private network that requires me to connect to a VPN first, and I forgot that.

So, check your database connection or anything that cause your app waiting for a long time.

Answered By: afifabroory

In my case, I first noticed that decreasing the number of workers from 4 to 2 worked. However, I believe that the problem is related to the connection to the Bank, I tried with -w4 but I restarted my server that contains the Bank and it worked perfectly.

Answered By: Mithsew
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.