Get stacktrace from stuck python process that does not accept signals

Question:

I have to run a legacy Zope2 website and have some grievance with it. The biggest issue is that, occasionally, it just locks up, running at 100% CPU load and not answering to requests anymore. While the problem isn’t reproducible on a regular basis, one page containing 3 dynamic graphs triggers it sometimes, so I suspect some kind of race condition that leads to an endless loop or a stuck busywait.

The problem is, I have not yet found a way to debug this thing. There’s nothing in the Zope logs and nothing in the system logs. I tried the suggestions from this question to get a stacktrace, but the only signal that has any effect is SIGKILL.

Is there another possibility to find out where exactly the process is when it gets stuck?

Asked By: Benjamin Wohlwend

||

Answers:

You could try to attach a debugger to the running process. See also this question.

Answered By: Thomas

If the process is stuck in a way that no other signal gets through, you might want to consider running it from a debugger, instead of trying to attach to it at runtime.

Also, it might be useful to other debugging tactics, like turning off certain parts of the code to find out the minimal case in which it is still reproducible in order to see what causes it better.

Answered By: abyx

See my answer to this SO question, use Products.signalstack. It registers the same handler as the answer you already found, at Product registration time. Perhaps it works better for you.

If not, you probably have a OS-level I/O problem on your hands, and your only hope is attaching gdb to the process. Search Stack Overflow for gdb answers; there is a wealth of information here!

Answered By: Martijn Pieters

after running around the internet in circles for a while I finally ended up here: http://podoliaka.org/2016/04/10/debugging-cpython-gdb/ – describes in detail how all the pieces fit together. the money quote for me was ‘gdb /usr/bin/python -p $PID’ – the name of the executable is required in order for gdb to find the correct debug info files.

Answered By: Baczek

You can print out a nice stack trace using pyrasite.

First, you’ll need to have gdb installed.

# Redhat, CentOS, etc
$ yum install gdb

# Ubuntu, Debian, etc
$ apt-get update && apt-get install gdb

Then, install pyrasite.

$ pip install pyrasite

Use ps or some other method to find the process ID for the stuck python process and run pyrasite-shell with it.

# Assuming process ID is 12345
$ pyrasite-shell 12345

You should now see a python REPL. Run the following in the REPL to see stack traces for all threads.

import sys, traceback
for thread_id, frame in sys._current_frames().items():
    print 'Stack for thread {}'.format(thread_id)
    traceback.print_stack(frame)
    print ''
Answered By: Sean

While pyrasite might work, it does not handle some corner cases and hang/fail silently.

If the package does not work as expected, it’s possible to do what the package does under the hood manually to figure out what went wrong.

  • Attach gdb to the Python process: gdb -p <PID> (may need sudo.)
  • Run the following functions by type the commands into gdb
set $gstate = PyGILState_Ensure()
call          PyRun_SimpleString(" <some Python code> ")
call          PyGILState_Release($gstate)

See Python API documentation for the functions: 1 2.


In case Python is not compiled with debug symbols, it’s necessary to provide the explicit data types for the functions:

Refer to the Python source code https://github.com/python/cpython/blob/4fe5585240f64c3d14eb635ff82b163f92074b3a/Include/pystate.h#L86-L88 , the type PyGILState_STATE is an enum with 2 values, so we "guess" that we can use int. (although it may not work.)

In conclusion, according to the documentation, the "correct (subject to the restriction above)" commands for the functions are

set $gstate = ((int (*)())            PyGILState_Ensure ) ()
call          ((int (*)(const char*)) PyRun_SimpleString) (" <some Python code> ")
call          ((void(*)(int))         PyGILState_Release) ($gstate)

This solution does not rely on the Python-debugging extension for gdb. Otherwise it’s possible to simply run py-bt.


I have a more up-to-date fork of pyrasite, (currently) named pyrasite-ng. If there’s any bug it can be reported there, hopefully I can fix it quickly.

Answered By: user202729

with the arrival python 3.8 you can also use faulthandler

import faulthandler
faulthandler.enable()
faulthandler.dump_traceback_later(timeout=10) 
// it will dump the traceback of all threads after a timeout of "10" seconds in this case

for more info checkout faulthandler documentation

Answered By: Divine Soul