Python program hangs forever when called from subprocess

Question:

The pip test suite employs subprocess calls to run integration tests. Recently a PR was placed which removed some older compatability code. Specically it replaced a b() function with explicitly uses of the b"" literal. However this has seemingly broken something to where a particular subprocess call will hang forever. To make matters worse it only hangs forever on Python 3.3 (maybe only Python 3.3.5) and it cannot easily be reproduced outside of Travis.

Relevant Pull Requests:

A similar problem occurs with other Pull Requests, however they fail on different versions of Python and different test cases. These Pull Requests are:

Another user has reported a similar issue to me today in IRC, they say they can reproduce it locally on Ubuntu 14.04 with Python 3.3 from deadsnakes (but not on OSX) and not only on Travis like I’ve mostly been able too thus far. They’ve sent me steps to reproduce which are:

$ git clone [email protected]:xavfernandez/pip.git
$ cd pip
$ git checkout debug_stuck
$ pip install pytest==2.5.2 scripttest==1.3 virtualenv==1.11.6 mock==1.0.1 pretend==1.0.8 setuptools==4.0
$ # The below should pass just fine
$ py.test -k test_env_vars_override_config_file -v -s
$ # Now edit pip/req/req_set.py and remove method remove_me_to_block or change its content to print('KO') or pass
$ # The below should hang forever
$ py.test -k test_env_vars_override_config_file -v -s

In the above example, the remove_me_to_block method is not called anywhere, just the mere existence of it is enough to make the test not block, and the non existence of it (or changing it’s contents) is enough to make the test block forever.

Most of the debugging has been with the changes in this PR (https://github.com/pypa/pip/pull/1901). Having pushed one commit at a time the tests passed until this particular commit was applied – https://github.com/dstufft/pip/commit/d296df620916b4cd2379d9fab988cbc088e28fe0. Specifically either the change to use b'rn' or (entry + endline).encode("utf-8") will trigger it, however neither of these things are in the execution path for pip install -vvv INITools which is the command that it fails being able to execute.

In attempting to trace down the problem I’ve noticed that if I replace at least one call to "something".encode("utf8") with (lambda: "something")().encode("utf8") it works.

Another issue while attempting to debug this, has been that various things I’ve tried (adding print statements, no-op atexit functions, using trollious for async subprocess) will simply shift the problem from a particular test case on a particular Python version to different test cases on different versions of Python.

I am aware of the fact that the subprocess module can deadlock if you read/write from subprocess.Popen().stdout/stderr/stdin directly. However This code is using the communicate() method which is supposed to work around these issues. It is inside of the wait() call that communicate() does that the process hangs forever waiting for the pip process to exit.

Other information:

  • It is very heisenbug-ey, I’ve managed to make it go away or shift based on various things that should not have any affect on it.
  • I’ve traced the execution inside of pip itself all the way through to the end of the code paths until sys.exit() is called.
  • Replacing sys.exit() with os._exit() fixes all the hanging issues, however I’d rather not do that as we’re then skipping the clean up that the Python interpreter does.
  • There are no additional threads running (verified with threading.enumerate).
  • I’ve had some combination of changes which have had it hang even without subprocess.PIPE being used for stdout/stderr/stdin, however other combinations will have it not hang if those are not used (or it’ll shift to a different test case/python version).
  • It does not appear to be timing related, any particular commit will either fail 100% of the time on the affect test cases/Pythons or fail 0% of the time.
  • Often times the code that was changed isn’t even being executed by that particular code path in the pip subprocess, however the mere existence of the change seems to break it.
  • I’ve tried disabling bytecode generation using PYTHONDONTWRITEBYTECODE=1 and that had an effect in one combination, but in others it’s had no effect.
  • The command that the subprocess calls does not hang in every invocation (similar commands are issued through the test suite) however it does always hang in the exact same place for a particular commit.
  • So far i’ve been completely unable to reproduce this outside of being called via subproccess in the test suite, however I don’t know for a fact if it is or isn’t related to that.

I’m completely at a loss for what could be causing this.

UPDATE #1

Using faulthandler.dump_traceback_later() I got this result:

Timeout (0:00:05)!
Current thread 0x00007f417bd92740:
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
  [ Duplicate Lines Snipped ]
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/requests/packages/urllib3/response.py", line 287 in closed
Timeout (0:00:05)!
Current thread 0x00007f417bd92740:
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
  [ Duplicate Lines Snipped ]
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/requests/packages/urllib3/response.py", line 287 in closed
Timeout (0:00:05)!
Current thread 0x00007f417bd92740:
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
  [ Duplicate Lines Snipped ]
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
Timeout (0:00:05)!
Current thread 0x00007f417bd92740:
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
  [ Duplicate Lines Snipped ]
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
Timeout (0:00:05)!
Current thread 0x00007f417bd92740:
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
  [ Duplicate Lines Snipped ]
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
Timeout (0:00:05)!
Current thread 0x00007f417bd92740:
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
  [ Duplicate Lines Snipped ]
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
Timeout (0:00:05)!
Current thread 0x00007f417bd92740:
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
  [ Duplicate Lines Snipped ]
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
Timeout (0:00:05)!
Current thread 0x00007f417bd92740:
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
  [ Duplicate Lines Snipped ]
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/requests/packages/urllib3/response.py", line 285 in closed
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__
  [ Duplicate Lines Snipped ]
  File "/tmp/pytest-10/test_env_vars_override_config_file0/pip_src/pip/_vendor/cachecontrol/filewrapper.py", line 24 in __getattr__

This suggests to me that maybe the problem is something to do with the garbage collection and urllib3? The Filewrapper in pip._vendor.cachecontrol.filewrapper is used as a wrapper around a urllib3 response object (which subclasses io.IOBase) so that we can tee the read() method to store the results of each read call in a buffer as well as returning it, and then once the file has been completely consumed run a callback with the contents of that buffer so that we can store the item in the cache. Could this be interacting with the GC in some way?

Update #2

If I add a def __del__(self): pass method to the Filewrapper class, then everything works correctly in the cases I’ve tried. I tested to ensure that this wasn’t because I just happened to define a method (which “fixes” it sometimes) by changing that to def __del2__(self): pass and it started failing again. I’m not sure why this works exactly and a no-op __del__ method seems like it’s less than optimal.

Update #3

Adding a import gc; gc.set_debug(gc.DEBUG_UNCOLLECTABLE) printed stuff to stderr twice during the execution of the pip command that has been hanging, they are:

gc: uncollectable <CallbackFileWrapper 0x7f66385c1cd0>
gc: uncollectable <dict 0x7f663821d5a8>
gc: uncollectable <functools.partial 0x7f663831de10>
gc: uncollectable <_io.BytesIO 0x7f663804dd50>
gc: uncollectable <method 0x7f6638219170>
gc: uncollectable <tuple 0x7f663852bd40>
gc: uncollectable <HTTPResponse 0x7f663831c7d0>
gc: uncollectable <PreparedRequest 0x7f66385c1a90>
gc: uncollectable <dict 0x7f663852cb48>
gc: uncollectable <dict 0x7f6637fdcab8>
gc: uncollectable <HTTPHeaderDict 0x7f663831cb90>
gc: uncollectable <CaseInsensitiveDict 0x7f66385c1ad0>
gc: uncollectable <dict 0x7f6638218ab8>
gc: uncollectable <RequestsCookieJar 0x7f663805d7d0>
gc: uncollectable <dict 0x7f66382140e0>
gc: uncollectable <dict 0x7f6638218680>
gc: uncollectable <list 0x7f6638218e18>
gc: uncollectable <dict 0x7f6637f14878>
gc: uncollectable <dict 0x7f663852c5a8>
gc: uncollectable <dict 0x7f663852cb00>
gc: uncollectable <method 0x7f6638219d88>
gc: uncollectable <DefaultCookiePolicy 0x7f663805d590>
gc: uncollectable <list 0x7f6637f14518>
gc: uncollectable <list 0x7f6637f285a8>
gc: uncollectable <list 0x7f6637f144d0>
gc: uncollectable <list 0x7f6637f14ab8>
gc: uncollectable <list 0x7f6637f28098>
gc: uncollectable <list 0x7f6637f14c20>
gc: uncollectable <list 0x7f6637f145a8>
gc: uncollectable <list 0x7f6637f14440>
gc: uncollectable <list 0x7f663852c560>
gc: uncollectable <list 0x7f6637f26170>
gc: uncollectable <list 0x7f663821e4d0>
gc: uncollectable <list 0x7f6637f2d050>
gc: uncollectable <list 0x7f6637f14fc8>
gc: uncollectable <list 0x7f6637f142d8>
gc: uncollectable <list 0x7f663821d050>
gc: uncollectable <list 0x7f6637f14128>
gc: uncollectable <tuple 0x7f6637fa8d40>
gc: uncollectable <tuple 0x7f66382189e0>
gc: uncollectable <tuple 0x7f66382183f8>
gc: uncollectable <tuple 0x7f663866cc68>
gc: uncollectable <tuple 0x7f6637f1e710>
gc: uncollectable <tuple 0x7f6637fc77a0>
gc: uncollectable <tuple 0x7f6637f289e0>
gc: uncollectable <tuple 0x7f6637f19f80>
gc: uncollectable <tuple 0x7f6638534d40>
gc: uncollectable <tuple 0x7f6637f259e0>
gc: uncollectable <tuple 0x7f6637f1c7a0>
gc: uncollectable <tuple 0x7f6637fc8c20>
gc: uncollectable <tuple 0x7f6638603878>
gc: uncollectable <tuple 0x7f6637f23440>
gc: uncollectable <tuple 0x7f663852c248>
gc: uncollectable <tuple 0x7f6637f2a0e0>
gc: uncollectable <tuple 0x7f66386a6ea8>
gc: uncollectable <tuple 0x7f663852f9e0>
gc: uncollectable <tuple 0x7f6637f28560>

and then

gc: uncollectable <CallbackFileWrapper 0x7f66385c1350>
gc: uncollectable <dict 0x7f6638c33320>
gc: uncollectable <HTTPResponse 0x7f66385c1590>
gc: uncollectable <functools.partial 0x7f6637f03ec0>
gc: uncollectable <_io.BytesIO 0x7f663804d600>
gc: uncollectable <dict 0x7f6637f1f680>
gc: uncollectable <method 0x7f663902d3b0>
gc: uncollectable <tuple 0x7f663852be18>
gc: uncollectable <HTTPMessage 0x7f66385c1c10>
gc: uncollectable <HTTPResponse 0x7f66385c1450>
gc: uncollectable <PreparedRequest 0x7f66385cac50>
gc: uncollectable <dict 0x7f6637f2f248>
gc: uncollectable <dict 0x7f6637f28b90>
gc: uncollectable <dict 0x7f6637f1e638>
gc: uncollectable <list 0x7f6637f26cb0>
gc: uncollectable <list 0x7f6637f2f638>
gc: uncollectable <HTTPHeaderDict 0x7f66385c1f90>
gc: uncollectable <CaseInsensitiveDict 0x7f66385b2890>
gc: uncollectable <dict 0x7f6638bd9200>
gc: uncollectable <RequestsCookieJar 0x7f663805da50>
gc: uncollectable <dict 0x7f6637f28a28>
gc: uncollectable <dict 0x7f663853aa28>
gc: uncollectable <list 0x7f663853a6c8>
gc: uncollectable <dict 0x7f6638ede5f0>
gc: uncollectable <dict 0x7f6637f285f0>
gc: uncollectable <dict 0x7f663853a4d0>
gc: uncollectable <method 0x7f663911f710>
gc: uncollectable <DefaultCookiePolicy 0x7f663805d210>
gc: uncollectable <list 0x7f6637f28ab8>
gc: uncollectable <list 0x7f6638215050>
gc: uncollectable <list 0x7f663853a200>
gc: uncollectable <list 0x7f6638215a28>
gc: uncollectable <list 0x7f663853a950>
gc: uncollectable <list 0x7f663853a998>
gc: uncollectable <list 0x7f6637f21638>
gc: uncollectable <list 0x7f6637f0cd40>
gc: uncollectable <list 0x7f663853ac68>
gc: uncollectable <list 0x7f6637f22c68>
gc: uncollectable <list 0x7f663853a170>
gc: uncollectable <list 0x7f6637fa6a28>
gc: uncollectable <list 0x7f66382153b0>
gc: uncollectable <list 0x7f66386a5e60>
gc: uncollectable <list 0x7f663852f2d8>
gc: uncollectable <list 0x7f66386a3320>
    [<pip._vendor.cachecontrol.filewrapper.CallbackFileWrapper object at 0x7f66385c1cd0>, <pip._vendor.cachecontrol.filewrapper.CallbackFileWrapper object at 0x7f66385c1350>]

Is that useful information? I’ve never used that flag before so I have no idea if that is unusual or not.

Asked By: Donald Stufft

||

Answers:

In Python 2, if a set of objects are linked together in a chain (reference cycle) and, at least, one object has a __del__ method, the garbage collector will not delete these objects. If you have a reference cycle, adding a __del__() method may just hide bugs (workaround bugs).

According to your update #3, it looks like you have such issue.

Answered By: vstinner
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.