Wrap an open stream with io.TextIOWrapper

Question:

How can I wrap an open binary stream – a Python 2 file, a Python 3 io.BufferedReader, an io.BytesIO – in an io.TextIOWrapper?

I’m trying to write code that will work unchanged:

  • Running on Python 2.
  • Running on Python 3.
  • With binary streams generated from the standard library (i.e. I can’t control what type they are)
  • With binary streams made to be test doubles (i.e. no file handle, can’t re-open).
  • Producing an io.TextIOWrapper that wraps the specified stream.

The io.TextIOWrapper is needed because its API is expected by other parts of the standard library. Other file-like types exist, but don’t provide the right API.

Example

Wrapping the binary stream presented as the subprocess.Popen.stdout attribute:

import subprocess
import io

gnupg_subprocess = subprocess.Popen(
        ["gpg", "--version"], stdout=subprocess.PIPE)
gnupg_stdout = io.TextIOWrapper(gnupg_subprocess.stdout, encoding="utf-8")

In unit tests, the stream is replaced with an io.BytesIO instance to control its content without touching any subprocesses or filesystems.

gnupg_subprocess.stdout = io.BytesIO("Lorem ipsum".encode("utf-8"))

That works fine on the streams created by Python 3’s standard library. The same code, though, fails on streams generated by Python 2:

[Python 2]
>>> type(gnupg_subprocess.stdout)
<type 'file'>
>>> gnupg_stdout = io.TextIOWrapper(gnupg_subprocess.stdout, encoding="utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'file' object has no attribute 'readable'

Not a solution: Special treatment for file

An obvious response is to have a branch in the code which tests whether the stream actually is a Python 2 file object, and handle that differently from io.* objects.

That’s not an option for well-tested code, because it makes a branch that unit tests – which, in order to run as fast as possible, must not create any real filesystem objects – can’t exercise.

The unit tests will be providing test doubles, not real file objects. So creating a branch which won’t be exercised by those test doubles is defeating the test suite.

Not a solution: io.open

Some respondents suggest re-opening (e.g. with io.open) the underlying file handle:

gnupg_stdout = io.open(
        gnupg_subprocess.stdout.fileno(), mode='r', encoding="utf-8")

That works on both Python 3 and Python 2:

[Python 3]
>>> type(gnupg_subprocess.stdout)
<class '_io.BufferedReader'>
>>> gnupg_stdout = io.open(gnupg_subprocess.stdout.fileno(), mode='r', encoding="utf-8")
>>> type(gnupg_stdout)
<class '_io.TextIOWrapper'>
[Python 2]
>>> type(gnupg_subprocess.stdout)
<type 'file'>
>>> gnupg_stdout = io.open(gnupg_subprocess.stdout.fileno(), mode='r', encoding="utf-8")
>>> type(gnupg_stdout)
<type '_io.TextIOWrapper'>

But of course it relies on re-opening a real file from its file handle. So it fails in unit tests when the test double is an io.BytesIO instance:

>>> gnupg_subprocess.stdout = io.BytesIO("Lorem ipsum".encode("utf-8"))
>>> type(gnupg_subprocess.stdout)
<type '_io.BytesIO'>
>>> gnupg_stdout = io.open(gnupg_subprocess.stdout.fileno(), mode='r', encoding="utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
io.UnsupportedOperation: fileno

Not a solution: codecs.getreader

The standard library also has the codecs module, which provides wrapper features:

import codecs

gnupg_stdout = codecs.getreader("utf-8")(gnupg_subprocess.stdout)

That’s good because it doesn’t attempt to re-open the stream. But it fails to provide the io.TextIOWrapper API. Specifically, it doesn’t inherit io.IOBase and doesn’t have the encoding attribute:

>>> type(gnupg_subprocess.stdout)
<type 'file'>
>>> gnupg_stdout = codecs.getreader("utf-8")(gnupg_subprocess.stdout)
>>> type(gnupg_stdout)
<type 'instance'>
>>> isinstance(gnupg_stdout, io.IOBase)
False
>>> gnupg_stdout.encoding
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/codecs.py", line 643, in __getattr__
    return getattr(self.stream, name)
AttributeError: '_io.BytesIO' object has no attribute 'encoding'

So codecs doesn’t provide objects which substitute for io.TextIOWrapper.

What to do?

So how can I write code that works for both Python 2 and Python 3, with both the test doubles and the real objects, which wraps an io.TextIOWrapper around the already-open byte stream?

Asked By: bignose

||

Answers:

Use codecs.getreader to produce a wrapper object:

text_stream = codecs.getreader("utf-8")(bytes_stream)

Works on Python 2 and Python 3.

Answered By: jbg

Based on multiple suggestions in various forums, and experimenting with the standard library to meet the criteria, my current conclusion is this can’t be done with the library and types as we currently have them.

Answered By: bignose

It turns out you just need to wrap your io.BytesIO in io.BufferedReader which exists on both Python 2 and Python 3.

import io

reader = io.BufferedReader(io.BytesIO("Lorem ipsum".encode("utf-8")))
wrapper = io.TextIOWrapper(reader)
wrapper.read()  # returns Lorem ipsum

This answer originally suggested using os.pipe, but the read-side of the pipe would have to be wrapped in io.BufferedReader on Python 2 anyway to work, so this solution is simpler and avoids allocating a pipe.

Answered By: jbg

Okay, this seems to be a complete solution, for all cases mentioned in the question, tested with Python 2.7 and Python 3.5. The general solution ended up being re-opening the file descriptor, but instead of io.BytesIO you need to use a pipe for your test double so that you have a file descriptor.

import io
import subprocess
import os

# Example function, re-opens a file descriptor for UTF-8 decoding,
# reads until EOF and prints what is read.
def read_as_utf8(fileno):
    fp = io.open(fileno, mode="r", encoding="utf-8", closefd=False)
    print(fp.read())
    fp.close()

# Subprocess
gpg = subprocess.Popen(["gpg", "--version"], stdout=subprocess.PIPE)
read_as_utf8(gpg.stdout.fileno())

# Normal file (contains "Lorem ipsum." as UTF-8 bytes)
normal_file = open("loremipsum.txt", "rb")
read_as_utf8(normal_file.fileno())  # prints "Lorem ipsum."

# Pipe (for test harness - write whatever you want into the pipe)
pipe_r, pipe_w = os.pipe()
os.write(pipe_w, "Lorem ipsum.".encode("utf-8"))
os.close(pipe_w)
read_as_utf8(pipe_r)  # prints "Lorem ipsum."
os.close(pipe_r)
Answered By: jbg

I needed this as well, but based on the thread here, I determined that it was not possible using just Python 2’s io module. While this breaks your “Special treatment for file” rule, the technique I went with was to create an extremely thin wrapper for file (code below) that could then be wrapped in an io.BufferedReader, which can in turn be passed to the io.TextIOWrapper constructor. It will be a pain to unit test, as obviously the new code path can’t be tested on Python 3.

Incidentally, the reason the results of an open() can be passed directly to io.TextIOWrapper in Python 3 is because a binary-mode open() actually returns an io.BufferedReader instance to begin with (at least on Python 3.4, which is where I was testing at the time).

import io
import six  # for six.PY2

if six.PY2:
    class _ReadableWrapper(object):
        def __init__(self, raw):
            self._raw = raw

        def readable(self):
            return True

        def writable(self):
            return False

        def seekable(self):
            return True

        def __getattr__(self, name):
            return getattr(self._raw, name)

def wrap_text(stream, *args, **kwargs):
    # Note: order important here, as 'file' doesn't exist in Python 3
    if six.PY2 and isinstance(stream, file):
        stream = io.BufferedReader(_ReadableWrapper(stream))

    return io.TextIOWrapper(stream)

At least this is small, so hopefully it minimizes the exposure for parts that cannot easily be unit tested.

Answered By: Vek

Here’s some code that I’ve tested in both python 2.7 and python 3.6.

The key here is that you need to use detach() on your previous stream first. This does not close the underlying file, it just rips out the raw stream object so that it can be reused. detach() will return an object that is wrappable with TextIOWrapper.

As an example here, I open a file in binary read mode, do a read on it like that, then I switch to a UTF-8 decoded text stream via io.TextIOWrapper.

I saved this example as this-file.py

import io

fileName = 'this-file.py'
fp = io.open(fileName,'rb')
fp.seek(20)
someBytes = fp.read(10)
print(type(someBytes) + len(someBytes))

# now let's do some wrapping to get a new text (non-binary) stream
pos = fp.tell() # we're about to lose our position, so let's save it
newStream = io.TextIOWrapper(fp.detach(),'utf-8') # FYI -- fp is now unusable
newStream.seek(pos)
theRest = newStream.read()
print(type(theRest), len(theRest))

Here’s what I get when I run it with both python2 and python3.

$ python2.7 this-file.py 
(<type 'str'>, 10)
(<type 'unicode'>, 406)
$ python3.6 this-file.py 
<class 'bytes'> 10
<class 'str'> 406

Obviously the print syntax is different and as expected the variable types differ between python versions but works like it should in both cases.

Answered By: Grey Christoforo