Subprocess, repeatedly write to STDIN while reading from STDOUT (Windows)

Question:

I want to call an external process from python. The process I’m calling reads an input string and gives tokenized result, and waits for another input (binary is MeCab tokenizer if that helps).

I need to tokenize thousands of lines of string by calling this process.

Problem is Popen.communicate() works but waits for the process to die before giving out the STDOUT result. I don’t want to keep closing and opening new subprocesses for thousands of times. (And I don’t want to send the whole text, it may easily grow over tens of thousands of -long- lines in future.)

from subprocess import PIPE, Popen

with Popen("mecab -O wakati".split(), stdin=PIPE,
           stdout=PIPE, stderr=PIPE, close_fds=False,
           universal_newlines=True, bufsize=1) as proc:
    output, errors = proc.communicate("foobarbaz")

print(output)

I’ve tried reading proc.stdout.read() instead of using communicate but it is blocked by stdin and doesn’t return any results before proc.stdin.close() is called. Which, again means I need to create a new process everytime.

I’ve tried to implement queues and threads from a similar question as below, but it either doesn’t return anything so it’s stuck on While True, or when I force stdin buffer to fill by repeteadly sending strings, it outputs all the results at once.

from subprocess import PIPE, Popen
from threading import Thread
from queue import Queue, Empty

def enqueue_output(out, queue):
    for line in iter(out.readline, b''):
        queue.put(line)
    out.close()

p = Popen('mecab -O wakati'.split(), stdout=PIPE, stdin=PIPE,
          universal_newlines=True, bufsize=1, close_fds=False)
q = Queue()
t = Thread(target=enqueue_output, args=(p.stdout, q))
t.daemon = True
t.start()

p.stdin.write("foobarbaz")
while True:
    try:
        line = q.get_nowait()
    except Empty:
        pass
    else:
        print(line)
        break

Also looked at the Pexpect route, but it’s windows port doesn’t support some important modules (pty based ones), so I couldn’t apply that as well.

I know there are a lot of similar answers, and I’ve tried most of them. But nothing I’ve tried seems to work on Windows.

EDIT: some info on the binary I’m using, when I use it via command line. It runs and tokenizes sentences I give, until I’m done and forcibly close the program.

(…waits_for_input -> input_recieved -> output -> waits_for_input…)

Thanks.

Asked By: umutto

||

Answers:

If mecab uses C FILE streams with default buffering, then piped stdout has a 4 KiB buffer. The idea here is that a program can efficiently use small, arbitrary-sized reads and writes to the buffers, and the underlying standard I/O implementation handles automatically filling and flushing the much-larger buffers. This minimizes the number of required system calls and maximizes throughput. Obviously you don’t want this behavior for interactive console or terminal I/O or writing to stderr. In these cases the C runtime uses line-buffering or no buffering.

A program can override this behavior, and some do have command-line options to set the buffer size. For example, Python has the “-u” (unbuffered) option and PYTHONUNBUFFERED environment variable. If mecab doesn’t have a similar option, then there isn’t a generic workaround on Windows. The C runtime situation is too complicated. A Windows process can link statically or dynamically to one or several CRTs. The situation on Linux is different since a Linux process generally loads a single system CRT (e.g. GNU libc.so.6) into the global symbol table, which allows an LD_PRELOAD library to configure the C FILE streams. Linux stdbuf uses this trick, e.g. stdbuf -o0 mecab -O wakati.


One option to experiment with is to call CreateConsoleScreenBuffer and get a file descriptor for the handle from msvcrt.open_osfhandle. Then pass this as stdout instead of using a pipe. The child process will see this as a TTY and use line buffering instead of full buffering. However managing this is non-trivial. It would involve reading (i.e. ReadConsoleOutputCharacter) a sliding buffer (call GetConsoleScreenBufferInfo to track the cursor position) that’s actively written to by another process. This kind of interaction isn’t something that I’ve ever needed or even experimented with. But I have used a console screen buffer non-interactively, i.e. reading the buffer after the child has exited. This allows reading up to 9,999 lines of output from programs that write directly to the console instead of stdout, e.g. programs that call WriteConsole or open “CON” or “CONOUT$”.

Answered By: Eryk Sun

Here is a workaround for Windows. This should also be adaptable to other operating systems.
Download a console emulator like ConEmu (https://conemu.github.io/)
Start it instead of mecab as your subprocess.

p = Popen(['conemu'] , stdout=PIPE, stdin=PIPE,
      universal_newlines=True, bufsize=1, close_fds=False)

Then send the following as the first input:

mecab -O wakafi & exit

You are letting the emulator handle the file output issues for you; the way it normally does when you manually interact with it.
I am still looking into this; but already looks promising…

Only problem is conemu is a gui application; so if no other way to hook into its input and output, one might have to tweak and rebuild from sources (it’s open source). I haven’t found any other way; but this should work.

I have asked the question about running in some sort of console mode here; so you can check that thread also for something. The author Maximus is on SO…

Answered By: Seyi Shoboyejo

The code

while True:
    try:
        line = q.get_nowait()
    except Empty:
        pass
    else:
        print(line)
        break

is essentially the same as

print(q.get())

except less efficient because it burns CPU time while waiting. The explicit loop won’t make data from the subprocess arrive sooner; it arrives when it arrives.

For dealing with uncooperative binaries I have a few suggestions, from best to worst:

  1. Find a Python library and use that instead. It appears that there’s an official Python binding in the MeCab source tree and I see some prebuilt packages on PyPI. You can also look for a DLL build that you can call with ctypes or another Python FFI. If that doesn’t work…

  2. Find a binary that flushes after each line of output. The most recent Win32 build I found online, v0.98, does flush after each line. Failing that…

  3. Build your own binary that flushes after each line. It should be easy enough to find the main loop and insert a flush call in it. But MeCab seems to explicitly flush already, and git blame says that the flush statement was last changed in 2011, so I’m surprised you ever had this problem and I suspect that there may have just been a bug in your Python code. Failing that…

  4. Process the output asynchronously. If your concern is that you want to deal with the output in parallel with the tokenization for performance reasons, you can mostly do that, after the first 4K. Just do the processing in the second thread instead of stuffing the lines in a queue. If you can’t do that…

  5. This is a terrible hack but it may work in some cases: intersperse your inputs with dummy inputs that produce at least 4K of output. For example, you could output 2047 blank lines after every real input line (2047 CRLFs plus the CRLF from the real output = 4K), or a single line of b'A' * 4092 + b'rn', whichever is faster.

Not on this list at all is an approach suggested by the two previous answers: directing the output to a Win32 console and scraping the console. This is a terrible idea because scraping gets you cooked output as a rectangular array of characters. The scraper has no way to know whether two lines were originally one overlong line that wrapped. If it guesses wrong, your outputs will get out of sync with your inputs. It’s impossible to work around output buffering in this way if you care at all about the integrity of the output.

Answered By: benrg

I guess the answer, if not the solution, can be found here
https://github.com/ikriv/ConsoleProxy/blob/master/src/Tools/Exec/readme.md

I guess, because I had a similar problem, which I worked around, and could not try this route because this tool is not available for Windows 2003, which is the OS I had to use (in a VM for a legacy application).

I’d like to know if I guessed right.

Answered By: Marco Gamberoni